From VinKo to AlpiLinK: web-based long-term storage and accessibility of information







 

Abstract

In this talk, we will discuss challenges, solutions and open questions connected with the reusability and therefore persistence of web-based research from the perspective of the AlpiLinK project (Rabanus et al. 2022ff). AlpiLinK (Alpine Languages in Contact, <https://www.alpilink.it>, a ‘project of relevant national interest’ [PRIN] financed by the Italian Research Ministry) aims at the collection of oral linguistic data from the Germanic and Romance non-standard and minority languages spoken in alpine Italy. It collects its data online and intends to provide easy access to the collected data and to basic information regarding the varieties and the project itself not just for an academic audience, but also for a general public, including but not limited to the local speech communities (e.g., public-outreach activities involving local high schools like VinKiamo <https://sites.hss.univr.it/vinkiamo/>).

While most of the publication of the scientific results continue to be paper based or in e-journals with paper-like procedures and standards, long-term storage and accessibility of online data and information for the general public is still in its infancy. AlpiLinK aims to maximize the probability of survival of its data by ensuring the data is stored in a structured format, which is machine- and human-readable, in a CLARIN repository similar to related datasets, e.g. VinKo (Rabanus et al. 2022), and AThEME (Tomaselli et al. 2022). However, the permanent accessibility of the additional information – web pages with sometimes detailed, multimodal and even interactive speech-variety profiles, comparisons of linguistic features across language borders and language change descriptions – remains a challenge.

This became especially evident in the transition from our previous VinKo project to AlpiLinK. While all data is securely stored in the above-quoted repository, it is an open question how long the alternative data representation via an interactive map (much appreciated by the general public) can be held operative, and all additional information from the pages of the old deactivated VinKo website are no longer accessible.

1. Introduction

This contribution discusses the challenges, solutions and open questions connected to the reusability and therefore persistence of web-based research from the perspective of the AlpiLinK project (Rabanus et al. 2022ff) and its predecessor, VinKo. 

While most of the publication of the scientific results continue to be paper based or in e-journals with paper-like procedures and standards, long-term storage and accessibility of online data and information for the general public is still in its infancy. AlpiLinK aims to maximize the probability of survival of its data by ensuring the data is stored in a structured format, which is machine- and human-readable, in a CLARIN repository similar to related datasets, e.g. VinKo (Rabanus et al. 2022) (Rabanus/Kruijt/Tagliani/Tomaselli/Padovan/Alber/Cordin/Zamparelli/Vogt 2023) , and AThEME (Tomaselli/Kruijt/Alber/Bidese/Casalicchio/Cordin/Kokkelmans/Padovan/Rabanus/Zuin 2022)(Tomaselli et al. 2022). However, the permanent accessibility of the additional information – web pages with sometimes detailed, multimodal and even interactive speech-variety profiles, comparisons of linguistic features across language borders and language change descriptions – remains a challenge.

This became especially evident in the transition from our previous VinKo project to AlpiLinK. While all data is securely stored in the above-quoted repository, it is an open question how long the alternative data representation via an interactive map (much appreciated by the general public) can be held operative, and all additional information from the pages of the old deactivated VinKo website are no longer accessible. The collection of data was done with very specific target constructions or features in mind, but work on and with the data in other contexts, e.g. school projects, has shown that there is plenty of valuable data which are not part of the initial targets. This paper elaborates and showcases the value of the FAIR treatment of the data for all stakeholders involved in the projects using a case study on the lexical item 'ragazza' and its realization across the project areas.

Include description of FAIR data standards and concrete aims of both projects to adher to them where possible. Open Science.

Important part of the discussion is also what stakeholders exist, what would be useful for storage, and in what format. Stakeholders 

2. Projects

This section provides a brief description of each of the projects described in this paper and highlights the areas in which the projects were similar and in which they were not. First a short introduction. The VinKo (Varieties in Contact 2017-2023) project was an collaboration of the University of Verona with the University of Trento and the Free University of Bozen-Bolzano. Its research focus was the documentation and description of the the dialects and minority language spoken in northeastern Italy (specifically Trentino-South Tyrol and Veneto). AlpiLinK (Alpine Languages in Contact, <https://www.alpilink.it>, a ‘project of relevant national interest’ [PRIN] financed by the Italian Research Ministry) is a joined project of the Universities of Verona, Trento, Turin, Valle D'Aosta and the Free University of Bozen-Bolzano. AlpiLinK's aims are "documentation, explanation and participation" in the collection of oral linguistic data from the Germanic, Romance and Slavic non-standard and minority languages spoken in alpine Italy to investigate language contact.

2.1. The VinKo system

2.1.1. Data collection

VinKo's data collection was done via online crowdsourcing. It utilized an online linguistic questionnaire in which participants were asked to make audio recordings of their responses to a variety of tasks. The tasks included were (image-aided) translation, pronunciation and question-answer tasks. The topics of interest were phenomena of language contact and contact-induced language change in the areas of phonology, morphology and syntax. For this purpose, language-independent variables were defined and stimuli constructed aimed at eliciting them. While maintaining some overlap in questionnaires, there were significant differences between the questionnaires offered for each language variety. For example, the translation task presented to Cimbrian speakers or Venetan speakers had partial overlap, but also language-specific stimuli. The analysis of the data has yielded publications on the morphology of articles and pronouns (Kruijt 2022), subject expletives in weather verbs (Tomaselli/Bidese 2023), and expletive articles with personal names(Rabanus 2023). For a detailed description of the VinKo project, please see (Kruijt/Rabanus/Tagliani 2023).

In its 6 year run VinKo collected a large amount of data for the areas of Trentino-South Tyrol and Veneto. At the end of the project it had collected data for 1439 participants, and per speaker on average around 140 recordings are available (ranging from single words to translated sentences). 

Number of informants 1439
Number of locations 387
Number of audio files 189.679
Languages 12

 VinKo data (Corpus 1.2)

2.1.2. Online systems

The VinKo website (taken offline in June 2023) formed the primary access point for users and researchers. It served as the main method of reaching and informing the general audience, the gateway to the linguistic questionnaire and, via a restricted section, access to the collected data. The technical implementation of the website was initiated at the University of Trento, taken over by the technical staff of the University of Verona, and the system was migrated entirely at the end of the project to the University of Verona. The system was designed using exclusively open non-proprietary software packages (e.g. Leaflet for the map) and made entirely in-house.

The public spaces of the website were primarily used by the general public and in project communication aimed at a general public. Especially the map was appreciated as it showed successful participation for participants (once completed, data collected in the questionnaire would be made instantly available on the map) and it served as a very efficient way of illustrating the linguistic variety in the area to students and teachers involved in the school projects.

2.1.3. Data management

No data management plan (DMP) present at start of the project. From the moment it got taken on by the project of Excellence, there was a 5-year guarantee for data and software storage on the University server, but no long term plan.During the registration phase, participants were presented with the data processing agreement of the project. It detailed the current and future usages of the supplied data, pointed out any potential risks to participants (e.g. identification on basis of voice), and informed them of their rights regarding data retraction and deletion under the GDPR-guidelines. The data was licensed as Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Italy (CC BY-NC-SA 3.0 IT). 

Data storage: include a brief section of CLARIN and EURACs involvement in it. (Rabanus/Kruijt/Tagliani/Tomaselli/Padovan/Alber/Cordin/Zamparelli/Vogt 2023)

2.1.4. Stakeholders

VinKo collected data via online crowdsourcing and as such had an important public outreach aspect to the methodology. The developed websites therefore has the aim to present all collected data online freely and intended to provide easy access to the collected data and to basic information regarding the varieties and the project itself not just for an academic audience, but also for a general public, including but not limited to the local speech communities (e.g., public-outreach activities involving local high schools like VinKiamo <https://sites.hss.univr.it/vinkiamo/>).  As such, the data is employed not only in scientific cycles, but also for educational purposes outside of the university (Bertollo/Rabanus 2023)

Indicate the lessons learned during this project regarding long-levity of data, online resources and reusability. 

2.2.1. Data collection

The data collection of AlpiLinK is very similar to VinKo. It is done via online crowdsourcing and through the lessons learned in VinKo, the crowdsourcing is now facilitated by increased and frequent public communication and school projects under the label VinKiamo across most regions involved in the project. It uses an online linguistic questionnaire in which participants are asked to make audio recordings of their responses to a variety of tasks (in combination with a few multiple choice questions regarding name truncation). The tasks include translation, tense and word class transformation, and image description. The topics of interest are phenomena of language contact and contact-induced language change in the areas of phonology, morphology and syntax, e.g. pronominal clitics, article use with proper nouns, declination, negation, possession, and word formation. Additionally to the brief sociolinguistic information it collects from participants at the start, it also has a final free section in which participants are asked to volunteer more information about their linguistic profile. This can be done in any language of their choosing. An optional part of the linguistic questionnaire has an open section in which people can volunteer information about the linguistic aspect of their current life and upbringing. These recordings are more difficult to get identical information for all locations, as participants themselves decided what information they think is relevant and they want to share. However, it does provide an unique insight into their lived linguistic experience and linguistic situation on the ground, and are often quite revealing in language attitudes and prestige. 

Apart from linguistic data, AlpiLinK also has a section dedicated to the linguistic landscape of the area. The data collection is done separately from the linguistic questionnaire and aims to collect photos of use of language(s) in the public domain. The data is archived and stored within the Lingscape project at the University of Luxemburg. 

In the first year of its data collection, AlpiLinK has managed to collect a significant amount of data with a good distribution across the regions involved.  a large amount of data for the areas of Trentino-South Tyrol and Veneto. Within a single year it has collected data for 907 participants, and per speaker on average 42 recordings are available (primarily recordings of single sentences).

Number of informants 813
Number of locations 424
Number of audio files 29.153
Languages 15

 AlpiLinK data (as of 28-03-2024)

2.2.2. Online systems

External proprietary software primarily used for the creation of the website and the data collection. 

2.2.3. Data management

Data storage is done also via an external source, but open platform, Zenodo (check their project description). Check also CLARIN network perhaps and involve the University Library to see if they can also store copies of datasets (after finalization of the project) or at least provide traces to the EURAC dataset. Mention linguistic landscape aspect - archiving and integration part of another project, does allow for better use of expertise within each project; collaborative.

2.2.4. Stakeholders

In continuation of VinKo, local high schools form an important part of the community stakeholders. 

3. Transition from VinKo to AlpiLinK

During the transition phases from VinKo to AlpiLinK, while it was projected to retain as much of the digital infrastructure as possible, the main focus was on safeguarding the data. From the beginning it was clear that at least a part of the investigated features of VinKo would be adopted by AlpiLinK, for research and accessibility it was integral that it remained available. 

From the feedback on the linguistic questionnaire from the school projects, it is clear that community stakeholders put more importance on lexicon than on grammatical features. As such, keeping not only abstract maps of the distribution of grammatical features, but the lexical items and audio files are more important to them. Neither project targets lexical items specifically, but of course they are non the less produced in the collection of the grammatical feature. In VinKo sentence T0303 was targeted at the collection of personal pronouns, but also elicited the word for 'girl'. In AlpiLinK, sentence I03 occassionally produces the word for 'girl' (non verbal cue with a picture of two girls, also commonly found lexical items are 'twins', 'sisters' and 'women').

3.1. What was projected to remain accessible

While moving from project to project it was initially visualized to retain the same digital infrastructure, necessitating only changes in the content. However, with the digital infrastructure being fractured across two departments of two different universities soon proved to be problematic in its application. Also the in-house design of the initial set-up was difficult to maintain and keep up-to-date. As it didn't fit with any of the technical applications of other projects of the University, allotment of time was much larger than justified by the necessity. As these practical issues soon became apparent, the gears were quickly shifted to instead fit with the software used for other projects (e.g. WordPress) or find an outside company that could provide the same services (at the fraction of the cost of in-house development)

Scientific data and its uses for didactics: involvement and training of students in methods and practices of science; development of social and digital compentences (check other competences in sheet drive); multilingual compentences and validation of cultural heritage (check EU priorities here). Citizen Science. Data forms basis for practical application of all these concepts, both in collection and in analysis, and presentation.

From the beginning of AlpiLinK it has been a main goal to simplify the storage and archiving of the data in a way which would allow simple integration with the VinKo data. With the current dataset, no matter how basic, this does allow for the creation of maps combining both datasets.

3.2. What doesn't remain accessible 

Project description, website, language description, any of the technical work that went into the development of the website or online map (still available now, but soon will be taken offline for sure wont persist very long). 

4. Discussion: Persistence of web-based research

What is desireable to remain accessible? And what does this mean for the allocation of resources?

In our case the technical infrastructure was (or will be) largely lost, meaning that heavy investments into the technical knowledge at an institutional level is not immediately rewarded or recognised. On a project basis this means by necessity a larger expendure on technical expertise from external (primarily profit driven companies); on an institutional level this is easier to manage from a budget point of view, not necessitation the employment of staff. Short sighted - and entails a lack of technical knowledge to inform projects in their initial stages. 

User-perspective  - what formats are useful for whom? general public is not easily helped with a online dataset, technical applications work best with a proper database, but regular linguistic researchers are not well-served by a database (e.g. SQLlite or similar) if they are unfamiliar with their workings. Spreadsheets are a way to not take a stance while maintaining a low threshold for use, hopefully with the purpose of encouraging reuse. However, it does imply a bit more legwork before it can be integrated in an online service.

5. Test map

 

Bibliography

  • Bertollo/Rabanus 2023 = Bertollo, Sabrina / Rabanus, Stefan (2023): VinKiamo: ein Citizen-Science-Projekt für Schulen zur Förderung von (sprach-) übergreifenden Kompetenzen, in: Alsic, vol. 26, 1, 1-19 (Link).
  • Kruijt 2022 = Kruijt, Anne (2022): Crowdsourcing language contact: pronoun and article morphology in Trentino-South Tyrol and Veneto, Verona, University of Verona.
  • Kruijt/Rabanus/Tagliani 2023 = Kruijt, Anne / Rabanus, Stefan / Tagliani, Marta (2023): The VinKo-Corpus: Oral data from Romance and Germanic local varieties of Northern Italy, in: Kupietz, Marc / Schmidt, Thomas (Eds.), Neue Entwicklungen in der Korpuslandschaft der Germanistik: Beiträge zur IDS-Methodenmesse 2022. Korpuslinguistik und Interdisziplinäre Perspektiven auf Sprache - Corpus linguistics and Interdisciplinary perspectives on Language (CLIP), Narr Francke Attempto.
  • Rabanus 2023 = Rabanus, Stefan (2023): Nome di battesimo e articolo espletivo – crowdsourcing e cartografica linguistica nello studio della variazione linguistica in Trentino-Alto Adige e Veneto, in: Schöntag, Roger / Linzmeier, Laura (Eds.), Neue Ansätze und Perspektiven zur sprachlichen Raumkonzeption und Geolinguistik: Fallstudien aus der Romania und der Germania, Peter Lang.
  • Rabanus/Kruijt/Alber/Bidese/Gaeta/Raimondi 2023 = Rabanus, Stefan / Kruijt, Anne / Alber, Birgit / Bidese, Ermenegildo / Gaeta, Livio / Raimondi, Gianmario (2023): AlpiLinK. German-Romance language contact in the Italian Alps: documentation, explanation, participation. In collaboration with Paolo Benedetto Mas, Sabrina Bertollo, Jan Casalicchio, Raffaele Cioffi, Patrizia Cordin, Michele Cosentino, Silvia dal Negro, Alexander Glück, Joachim Kokkelmans, Andriano Murelli, Andrea Padovan, Aline Pons, Matteo Rivoira, Marta Tagliani, Caterina Saracco, Emily Siviero, Alessandra Tomaselli, Ruth Videsott, Alessandro Vietti & Barbara Vogt. (Link).
  • Rabanus/Kruijt/Tagliani/Tomaselli/Padovan/Alber/Cordin/Zamparelli/Vogt 2022 = Rabanus, Stefan / Kruijt, Anne / Tagliani, Marta / Tomaselli, Alessandra / Padovan, Andrea / Alber, Birgit / Cordin, Patrizia / Zamparelli, Roberto / Vogt, Barbara Maria (2022): VinKo (Varieties in Contact) Corpus v1.1, University of Verona, Eurac Research CLARIN Centre (Link).
  • Rabanus/Kruijt/Tagliani/Tomaselli/Padovan/Alber/Cordin/Zamparelli/Vogt 2023 = Rabanus, Stefan / Kruijt, Anne / Tagliani, Marta / Tomaselli, Alessandra / Padovan, Andrea / Alber, Birgit / Cordin, Patrizia / Zamparelli, Roberto / Vogt, Barbara Maria (2023): VinKo (Varieties in Contact) Corpus v1.2, Eurac Research CLARIN Centre, University of Verona (Link).
  • Tomaselli/Bidese 2023 = Tomaselli, Alessandra / Bidese, Ermenegildo (2023): Fortune and Decay of Lexical Expletives in Germanic and Romance along the Adige River, in: Languages, vol. 8, 1, 44, Number: 1 Publisher: Multidisciplinary Digital Publishing Institute (Link).
  • Tomaselli/Kruijt/Alber/Bidese/Casalicchio/Cordin/Kokkelmans/Padovan/Rabanus/Zuin 2022 = Tomaselli, Alessandra / Kruijt, Anne / Alber, Birgit / Bidese, Ermenegildo / Casalicchio, Jan / Cordin, Patrizia / Kokkelmans, Joachim / Padovan, Andrea / Rabanus, Stefan / Zuin, Francesco (2022): AThEME Verona-Trento Corpus, Eurac Research CLARIN Centre (Link).

Leave a Reply