VerbaAlpina and the Challenge of being FAIR

Thomas Krefeld | Stephan Lücke (LMU)

1. 0. preliminary remark

The current funding phase serves the development of a second major subject area, namely nature, and aims to designate weather phenomena, landscape formations, fauna and flora. At the same time, much work was invested in the conception and practical implementation of reliable data management procedures in web-based research projects. Since the fundamental importance of this second topic in Romance and Italian geolinguistics does not yet seem to have been widely recognized, it was decidedly placed in the foreground of this work report.

2. 1. science communication on the Internet

Research funds are limited; access to them must therefore be regulated competitively. If research in this respect is subject to competition, it is above all collaborative in nature: progress is only possible on the basis of the knowledge already available. With regard to collaboration – which is fundamentally based on communication – the framework conditions have changed completely in the last 15 years: Within a few years, a society has emerged that is explicitly referred to as a knowledge society, since it presupposes the permanent and ubiquitous availability of new media in the private and public spheres and thus practically unlimited access to knowledge of all kinds.

This perfect mediatization, however, does not only affect the consumption of knowledge, but also the generation of knowledge through research, not least because it enables a very broad, location-independent cooperation. Of course, the researchers have not reached the land of milk and honey, because the option of cooperation is by no means automatically concretised. Rather, it requires the observance of some elementary rules, which have recently been named the Sigle FAIR, which was launched by an important initiative (🔗). This identifies four fundamental ethical principles for science communication under the conditions of the new media. According to them, research data must be

F_indable (‚findable‘),
A_ccessible (‚accessible‘),
I_nteroperable (‚compatible‘),
R_eusable (‚reusable‘)

(🔗). The requirements of three (F, A, R) of the four principles aim to be both human readable and machine readable; they therefore apply to both human-machine-human communication and machine-machine communication. The fourth principle (I) applies only to the latter; it is, however, central to the progress of research in the outlined virtual-medial framework and thus represents the indispensability of the technological component and the transformation of the LESER into an interactive USER, who can be depicted on a continuum between highly specialized experts and complete laypersons and who approaches the data not only with a reading eye, but possibly with the intention of using it for his own research purposes and, for this purpose, using mechanical ‚harvest helpers‘ (the so-called ‚harvest helpers‘). harvesting).

The operationalisation of the FAIR principles requires a complex interplay of researchers, i.e. de facto temporary and therefore more or less precarious project work on the one hand and institutions on the other, which can promise permanence; these are first and foremost the large libraries. The development of procedures for this very special type of cooperation is one of the current challenges of research, which is referred to as research data management (FDM). This marks important cornerstones of science communication on the Web that define the horizon of this contribution.

3. 2. FAIRness in a web-based research environment

The VerbaAlpina (VA) project attempts to consistently design research communication in the sense outlined above according to the FAIR principles. The following five complementary and closely interwoven functional areas (🔗) must be distinguished:

Documentation;
Publication
Cooperation;
Data collection through crowdsourcing;
Research laboratory.
2.1 FAIRness of the publication
The whole Internet is nothing more than a huge publication machine; it is, however, absolutely necessary to differentiate, because it will be quite different and partly different from what is published under the media conditions of printing. From VerbaAlpina are

semantic content (dialect forms, analytical scientific text),
Metadata,
Software and Code
published (🔗).

Without exception, stable data and text files are produced, since the entire platform (user interface and databases) is ‚frozen‘ or versioned every six months; in addition, there is a current working version (version xxx), which is still subject to changes and should therefore not be quoted (🔗). However, the most recent version does not replace the previous version, but complements it, since all previous versions are retained, so that all quotations and links within the project as well as from outside to the project are always accessible.

It is also ensured that the versions are easy to find, as they are assigned a DOI by the LMU’s UB (http://dx.doi.org/10.5282/verba-alpina); at the same time, VA as a whole is included in the library catalogues (🔗).

In the same way, all thematic text contributions published on the project page under the tabs Lexicon alpinum, Methodology and Contributions can be identified; they also receive a DOI and can therefore be directly cited (cf. e.g: Krefeld, T. / Lücke, S.: s.v. „butyru(m)“, in: VA-de 18/2, Lexicon alpinum, http://dx.doi.org/10.5282/verba-alpina?urlappend=%3Fpage_id%3D2374%26db%3D182%23B128 Transfer titles to Citavi project on the basis of this DOI).

A similar function is performed by the URN, which is registered with the German National Library in Frankfurt. Finally, the entire source code of VA with all programmed tools can be found and accessed under github. Technically, the procedure is based on exporting all VA data to a repository of the UB (open data lmu), in which metadata in the DataCite format is also assigned.

A major role in the design of the metadata schemata is played by the standard data, which allow a clear and finely granulated identification of the research data. VA distinguishes between three data categories (or: entities), for which own identifiers are assigned, which can be retrieved in connection with the data: Concept‘, ‚morpholexical type‘ (see typing) and ‚municipality‘. This results in highly specified metadata (see the DataCite example for the SENNHÜTTE concept, which contains the VA identifier C1 as well as the onomasiological identifiers of the Wikidata project that are also available in VA: Q136689, Q27849269, Q2649726), which basically allow the unique referencing to individual data of defined object classes across project boundaries.

This export guarantees the accessibility and reusability of the data after the end of project funding. The data is exported via an API interface (cf. API documentation), which is publicly accessible on the Internet and can also be used for output in other formats and enriched with metadata from basically any other standard, e.g. according to CLARIN-D. The data can also be exported in other formats. A rough overview of the emerging research data management (as of 11.3.2019) can be found in the following scheme:

Research data management between projects, institutions and the public

With regard to the searchability and accessibility, two basic remarks are still appropriate:

Since neither a standard metadata schema nor the question of which institutions should monitor compliance with it and the long-term preservation of data and metadata has been bindingly defined in the scientific community to date, VA has decided on a flexible interface concept that allows the use of basically any metadata schema. VA is also involved in two current research projects dealing with this problem: the LRZ Initiative (GeRDI) and the „eHumanities – interdisciplinary“ project funded by the Bavarian State Government. In the GeRDI project, data from very different disciplines are to be linked via metadata by defining common attributes (which is very simple and often useful in the case of geo- and chronoreferencing, for example).

Apart from the metadata referencing the specific project data, it is strongly recommended to use identifiers and standards data that are established outside the project, in order to ensure traceability and technical interoperability. VA has therefore recently started using the identifiers of the so-called Wikidata data objects. They provide references for non-linguistic realities and concepts and thus a common frame of reference for many different languages; there are currently (14.12.2018) Wikipedia articles in 133 languages about the dairy product butter. The very different articles are linked by referencing the unique identifier of the corresponding Wikidata data object (Q34172). A search engine that searches for it is thus able to find all the corresponding 133 terms – and so, at least theoretically, the numerous dialect forms documented in VA (1926 proofs). A similar system for designation types, i.e. for words (L-ID), is under construction at Wikidata.
2.2 FAIRness of the Documentation
VA documents dialect evidence of the three large European language families, which prove to be specifically alpine in ethnolinguistic terms.

The material was transformed into a systematically structured database and annotated according to linguistic criteria (‚morpho-lexical types‘, ‚basic types‘) and non-linguistic criteria (‚concepts‘); in addition to machine-readable access via the API interface mentioned above, there is also a human-readable access, which is made very vividly via an interactive map; the Google Maps map currently still used for this purpose will soon be replaced by a map with improved functionality based on Open Street Maps and the Javascript framework Leaflet, which has already been largely developed (🔗 ).

The mentioned categories of data structuring act as filters on the map surface. Already on this user-friendly interface, which is especially suitable for laymen, a simple as well as elementary function of reusability has been implemented, because it is possible to share all maps that can be displayed in exactly the same form (with the corresponding zoom level, opened windows etc.) with others or to integrate them into publications etc., because by clicking on a ’share button‘ a sendable URL is generated for the currently displayed map; thus the following link leads to a map of all dialectal names of BUTTER available in VA.

The available language material comes from two sources: A smaller part of the material offered by VA was collected by the project itself using the crowdsourcing method (see 2.4 below). However, the largest part was obtained from printed work or work intended for printing, so there are also forms which were made available to us within the framework of partnership agreements from projects not yet completed (see, for example, the point network of the Language Atlas of Upper Austria). Dictionary material is also taken into account, provided that the linguistic evidence can be georeferenced; this is the case with good dialect dictionaries such as the DRG or the VSI. In fact, every source can also be chronoreferenced, but this function has not yet been implemented.

Through retro-digitisation and the web presence, numerous dialect expressions, some of which are difficult to access in publications, are made ‚dormant‘, easy to find (F), accessible (A), interoperable (I) and reusable in a generally compatible way (R); because all available forms have a persistent identifier and will soon also be accessible via a Digital Object Identifier (DOI). Here is an example from the Language and Subject Atlas of Italy and Southern Switzerland, AIS (1928-1940).

VA thus produces FAIRen output so to speak. However, most of the sources, the input, are miles away from FAIRness. The reasons for this are partly technical, partly legal and finally commercial. As a rule, language atlases are only accessible as physical printed works; only very few offer at least the most elementary stage of digitisation, i.e. digital photos (scans), such as the AIS in the form of the NavigAIS or the SDS with regard to the original material. Not a single older atlas has so far been prepared in the form of a structured corpus that also allows the export of data. After all, such a solution could be found on the basis of a cooperation agreement for the ALD; the printing of this atlas by Hans Goebl was based on a digital format, which was not interoperable due to the lack of identifiers of the contents, but which proved to be machine-readable and correspondingly reusable after certain adaptations; all designations of relevant concepts therefore appear in VerbaAlpina (cf. the ALD location network and this example).

Findable Accessible Interoperable Reusable
human mach. human mach. human mach. mach. mach. human mach.
ALI – – – – – – – –
SDS + + + – – – + –
AIS + + + – – – + –
ALD + + + – – + + +
VA + + + + + + + + +
The situation appears different, much more complex, in relation to georeferenced dictionaries; the recently available online version of the DRG is set up in such a way that each lemma is accessible as a digital object thanks to an identifier (A), for example bargia ‚Schopf‘. A machine export is not planned, however, and it can be seen that the technical possibility of direct referencing to a lemma via a URL is also more of a technical „waste product“, which more or less coincidentally arose during software development. In any case, no citation link is offered and there generally seems to be no concrete indication for users of this possibility, so that use is ultimately left to the „cunning“ of the user.

A series of online encyclopaedias on two Ladin dialects, the Badiot and the Gherdëina (https://www.micura.it/de/woerterbuecher) have recently been published. All of them are the responsibility of the Ladin Cultural Institute in St. Martin in Thurn (Istitut Ladin Micurà de Rü) and all of them are obviously derived from publications in book form. They are each based on two encyclopaedias for German (Mischì, Giovanni, Wörterbuch deutsch – gadertalisch = Vocabolar todësch – ladin, San Martin de Tor 2001 [ISBN 88-8171-028-5 Take over titles from this ISBN into Citavi project]; ders.., Dictionary : German – Grödner-Ladinisch = Vocabuler : tudësch – ladin de Gherdëina, San Martin de Tor 2002 [ISBN 88-8171-033-1 Transfer title to Citavi project based on this ISBN]) and Italian (Moling, Sara [eds.Dizionario italiano – ladino Val Badia; Dizionar ladin Val Badia – talian, San Martin de Tor 2016 [ISBN 978-88-8171-120-8 Transfer titles from this ISBN to Citavi project];
Forni, Marco [ed.], Dizionario italiano – ladino gardenese = Dizioner ladin de gherdëina – talian, San Martin de Tor 2013 [978-88-8171-106-2]), with only the Italian encyclopaedias using the bi-directional perspective Italienisch⇔Badiot and Badiot⇔Italienisch; the German encyclopaedias are monodirectional Deutsch⇒Badiot and Deutsch⇒Gherdëina respectively.

The lexical inventory of these works is now also available on the Internet, although the concrete procedure for digitization and also the structure of the underlying database are completely unclear. The division of the database into four separate book publications is also reflected, surprisingly and at the same time unnecessarily, in the data presentation on the Internet. Each encyclopedia has its own Internet portal. Obviously, the two portals for the German encyclopaedias were conceived and realized by other developers than the Italian ones.

Badiot⇔Italienisch (and vice versa): http://itavalbadia.ladinternet.it/
<font color=“#ffff00″>-=Badiot⇒Deutsch=- sync:ßÇÈâÈâ
Gherdëina⇔Italienisch: http://dizionario-italiano-gardenese.ladinternet.it/ (= http://forniita.ladinternet.it/)
<font color=“#ffff00″>-=Gherdëina⇒Deutsch=- sync:ßÇÈâÈâ
The Florentine company SmallCodes, which has been developing technical solutions in the field of (mainly upper) Italian dialect research for years, is responsible for the Internet portals of the Italian data. The developers of the portals of the German Ladin encyclopedias are not named.

According to the division into four portals, the respective databases are also not related to each other, which has the consequence, e.g., that when searching from the Italian name lumaca (SCHNECKE), one gets the sgnech from Val Badia, but not the German name Schnecke, which in turn is linked to sgnech via the German portal. The snech in Gardena, which is obviously closely related to sgnech, can only be reached after a separate search on the corresponding portal – and this although the two portals (Italienisch⇔Badiot and Italienisch⇔Gherdëina) were developed by the same company. All this documents that the databases, which are basically united under one institutional umbrella, are at least technically not related to each other and are therefore not ‚interoperable‘ in the sense of the FAIR principles. This also applies to the possibilities of connection from outside: a direct reference to the morpholexical types sgnech, snech and lumaca just mentioned as examples is technically not possible. The only exceptions are the German versions of the online lexicons, which at least allow URL-based references to the German lemma (e.g. https://www.micura.it/de/woerterbuecher/vb/dl?q=Snail), but a reference to the Ladin types is not technically possible here either.

Also deplorable is the poor findability of the morpholexical types collected in the encyclopedias from outside the actual portals, i.e. via the Internet or via library catalogues. However, this is not the responsibility of the individual actors, but is based on the lack of aggregators, which can link separately generated and administered databases with each other using suitable metadata schemes. Such structures are currently only in their development phase. One example is the „Generic Research Data Infrastructure“ (GeRDI), in which VerbaAlpina is involved as a partner and pilot project. In any case, the lack of findability means that the Ladin online dictionaries discussed here do not suffice for another postulate formulated in the acronym FAIR – the „F“: Findable. The same applies to the remaining FAIR requirements of accessibility (A: Accessible) and reusability (R: Reusable). Although accessibility via the Internet is possible in principle, it is significantly restricted by the fact that the data stock can only be queried by manual form entries. A complete or at least partial export of the data on the basis of freely definable filters is apparently not possible. Also obviously no API exists, which represents an important condition for the mechanical processing, also and straight in the sense of the linkage with congruent external data stocks. The lack of an API causes at the same time and additionally the lack of interoperability of the data. Finally, the reusability of the data is decisively limited not least by the license model under which it is made available: The copyright permits the use of the data only in very limited extent, according to German right essentially only in the context, what the quotation right permits.

The Ladin online dictionaries discussed here therefore only give the impression of contemporary web publications at first glance. It is to be welcomed that the material is available on the Internet at all, and certain functions and concepts go beyond what conventional book publications are capable of. These include, on the one hand, the presentation of sound recordings (which cannot be linked to again) and the existence of an onomasiological tool („galleria immagini“ in the Italian-Ladin modules; again not referencable by URL), which breaks down the data stock via clickable images. Ultimately, however, the web portals are also afflicted with restrictions that are actually only inherent to the book, and it is obvious that the possibilities of the new media are not being used here with the necessary determination and consistency. From VerbaAlpina’s point of view, this is particularly regrettable because it is de facto impossible to link one’s own databases with those of the Ladin dictionaries, even in a selective (and often reciprocal) way.

The Niev Vocabulari sursilvan online does not offer this possibility, so that interoperability is out of the question.

 

2.3 FAIRness of cooperation
VA is supported by numerous partner projects; the great potential of this cooperation is self-evident and does not actually require any explanation. Nevertheless, the constructive perspective of multiple and complementary re-use of compatible partner projects will be illustrated by an example: Within the framework of the Archivio lessicale dei dialetti trentini (ALTR), five printed dialect dictionaries from different valleys (from the period between 1955 and 1984) were brought together in a database. Thanks to a project partnership, VA was able to convert and import the relevant expressions so that they can now be displayed cartographically in the context of all Alpine dialects; cf. the following designation of a device for churning: smalzaia).

The project architecture and the corresponding software have already proved to be interoperable in the cooperation; the Sicilian regional and special dictionary of Sottile 2002 could be re-used without difficulty and presented as an atlas (cf. the Atlante linguistico della Sicilia online, which has been expanded by the Sicilian partners since 2019). The atlas of the Picardy in northern France and Belgium, which is in the process of being created, has also been based on the concept and technology of VA since autumn 2018 (cf. Verba Picardia).

2.4 FAIRness in crowdsourcing
Crowdsourcing procedures are primarily, albeit not exclusively, aimed at laypersons; they therefore require that central data areas are intuitively easy to find and accessible for human users. The data is brought into a structured and interoperable format by the type of survey, which allows subsequent use. VA uses crowdsourcing in two ways: First, an aesthetically pleasing and easy-to-use data collection tool was programmed (join in!); a tutorial was also posted on Youtube for this purpose. Furthermore, a zooniverse site has just been set up to pass on at least part of the transcription work required for retro-digitisation to the crowd (🔗). Interoperability of the VA database is also a prerequisite for this.

The survey tool was promoted by popular scientific lectures in the adult further education of relevant occupational groups (on 20.4.2018, 26.2.2018, 7.10.2017) and was also well received by the mass media. The evaluation is interesting, as it shows that project reports are particularly relevant on the Internet, where a link offers direct, intramedial access, so to speak: The strongest response by far therefore came from a post on the website of Bayerischer Rundfunk (27.4.2018); a total of 11486 dialect forms (as of 12.3.2019) have so far been contributed by the 955 ‚crowders‘ (🔗).

3. current developments and perspectives
VerbaAlpina has been able to attract a significant and still growing number of project partners. In the course of the efforts for sustainability and reusability in the sense of the FAIR principles, a cooperation with the CLARIN-D Centre Leipzig has recently been established, the primary aim of which is to place the VerbaAlpina project data in this repository as well. Work is currently underway on data transmission, which will take place via the recently available API of the VA project portal.

VerbaAlpina provides each project partner with its own MySQL database, which is operated on the same database cluster as the VerbaAlpina database. The type and extent of use of these databases varies greatly. It should be noted that currently lexical material from the Atlante linguistico della Sicilia (ALS) is systematically transferred to the corresponding partner database (PVA_ALS). This is language data from the Madonie, a mountain range located on the north coast of Sicily, in which cattle and dairy farming have traditionally been practised. The data transferred to the partner database are also automatically visualised on the online portal operated by ITG with integrated interactive map (http://www.als-online.gwi.uni-muenchen.de/carta/). From an onomasiological perspective, the material of the ALS is largely congruent with the material collected by VerbaAlpina from the Alpine region and thus opens up extended possibilities for recognising supra-regional connections, as has already been done exemplarily within the framework of the project (see Krefeld, T.: s.v. „tomme / toma (f. (roa.)“, in: VerbaAlpina-de 18/2, Lexicon alpinum, https://doi.org/10.5282/verba-alpina?urlappend=%3Fpage_id%3D2374%26db%3D182%23L616 Take over title on the basis of this DOI in Citavi project ). However, the logical connection between the database in PVA_ALS and the VA database poses a major challenge, which can at best be sketched out in VerbaAlpina. In practice, the two databases would be interconnected by mutually assigning the concepts and morpholexical types to a common, central standards data instance. According to VerbaAlpina, this could serve as a model for the development of a universal lexicography that could (not only) make lexical connections visible across time and space.