FAIRness principle implementation in the NavigAIS, AMDV and AIS Reloaded projects

Version:



Keywords: AIS , Atlante linguistico ed etnografico dell'Italia e della Svizzera meridionale , AIS navigator , NavigAIS , AIS Reloaded , AISr , AIS the Digital Turn , AISdt , AMDV , dialect , dialectology , FAIR principles , FAIR , geolinguistics , Italian dialects , linguistic atlas , linguistic geography , multimedia , Switzerland , Veneto dialects

Citation:
  1. Referenz auf den gesamten Beitrag:
    Graziano Tisato (2020): FAIRness principle implementation in the NavigAIS, AMDV and AIS Reloaded projects, Version 3 (09.05.2020, 17:59). In: Thomas Krefeld & Stephan Lücke & Christina Mutter (Eds.) (2020): Berichte aus der digitalen Geolinguistik (II), Version 5. In: Korpus im Text, url: http://www.kit.gwi.uni-muenchen.de/?p=57577&v=3
    Diese URL enthält einen Hinweis auf die unveränderliche Version (…v=nn)
  2. Referenz auf einen Abschnitt oder Nachweis eines Zitats: http://www.kit.gwi.uni-muenchen.de/?p=57577&v=3#p:1
    Diese URL enthält einen Hinweis auf einen spezifischen Abschnitt (…p:1). In diesem Fall ist sie mit dem ersten Absatz verknüpft. Eine vollständige Referenz kann jeweils über die Ziffern in den grauen Quadraten am rechten Rand des Satzspiegels abgegriffen werden.
Abstract

The paper presents the methodology, the development, the results, and the FAIRness principle implementation of three related projects:

NavigAIS (2009-2017), a high resolution navigable version of the AIS, the Linguistic and Ethnographic Atlas of Italy and Southern Switzerland (Sprach- und Sachatlas Italiens und der Südschweiz) (Karl Jaberg and Jakob Jud, 1928-1940), born to allow the exploration of the 1705 maps contained in the atlas, and to digitize the text of the whole AIS, in these successive stages:

  1. Stand alone version https://www3.pd.istc.cnr.it/navigais (Ch. 5.1).
  2. Online version https://www3.pd.istc.cnr.it/navigais-web, and embedded object in a HTML, PowerPoint or WordPress document as in Fig. 1a-1b (Ch. 5.4).
  3. OCR version, which concludes the project with the AIS database creation (Ch. 6).

Fig. 1a – NavigAIS embedded in the current document as a fully functional application (Ch. 5.4).

 

AMDV – Multimedia Atlas of Veneto Dialects (2009-2015) (Ch. 7), a framework to analyze the diachronic evolution of the Italian dialects occurred in 90 years in the Veneto region (North East of Italy), by means a recording campaign in the same locations subject of the original AIS surveys, and on the same questionnaire used at that time (http://www.pd.istc.cnr.it/amdv). NavigAIS assured the real time access to the AIS maps to check on the field the speaker’s answers, and to implement the database of the AMDV 2009-2010 lemmas and the related 1921 AIS lemmas.

AISr – AIS Reloaded (2016-2019) (Ch. 6), that intends to achieve an online searchable database of the AIS lemmas, and carry out an investigation in the Swiss Canton Tessin and Grisons, to study the changes occurred in dialects in a century (http://www.rose.uzh.ch/de/forschung/forschungamrose/projekte/AIS-reloaded.html). For the creation of the database, NavigAIS has been provided with a specific supervised OCR, to acquire the AIS text in acceptable times. Half of the work (maps 1-880 of a total of 1705) has already been completed by the end of 2019, and 700,000 entries in all are now accessible and downloadable at https://www.ais-reloaded.uzh.ch.

Fig. 2 – NavigAIS Online Version: https://www3.pd.istc.cnr.it/navigais-web?map=1325&point=570&zoom=1.65

1. Saving a Cultural Heritage

The paper deals with the methodology, the development and the state of art of three interrelated projects, NavigAIS, AMDV and AIS Reloaded, and their compliance to the FAIR principles.

The projects share two common objectives.

First of all, the declared intention to save and promote the cultural heritage of the Linguistic and Ethnographic Atlas of Italy and Southern Switzerland (AIS) (Sprach- und Sachatlas Italiens und der Südschweiz) by the Swiss linguists Karl Jaberg and Jakob Jud (Jaberg/Jud 1928-1940b).

A precious heritage, because according to Pier Paolo Pasolini (Stockholm interview, 30/10/1975), despite all, the fate of dialects is sealed:

The fascism tried, during all the twenty years of power, to destroy the dialects. It could not. Instead, the consumerism power, which says he wants to preserve the dialects, is destroying them.

In the same time, the aim of the projects is to make the developed linguistic resources freely available online, in agreement to the open data access concepts, or to the FAIR principles, that is the recent acronym (Wilkinson u.a. 2016, Lücke 2019, Krefeld/Lücke 2019), to indicate the characteristics of Findability, Accessibility, Interoperability, Reusability, that certain data (for example the results of the public research, as the AIS) must provide to face the challenges of an interconnected world.

What do these concepts mean? Essentially, let the machines do what they do best (and the humans do worst), repetitive, analytical, probabilistic operations on immense amounts of data.

The awareness of the need for such an approach in the implementation of linguistic resources arose with the experience gained by the author in the realization of a linguistic atlas on the Trentino (North Italy) dialects (Il Trentino dei Contadini) by the Museo degli Usi e Costumi della Gente Trentina (San Michele all’Adige, Trento) (Mott/Kezich/Tisato 2003, Fig. 3, 4, 5, Ch. 4).

The aim of the work was to investigate the diachronic evolution of the Trentino dialects, comparing the lexical materials acquired in 1921 by the AIS, with those recorded in the same localities at the present time. The atlas completion required 5 years for only 3200 words, mainly for the replacement of the AIS original symbolism with a simplified phonetic set, invented to facilitate readability and implementation, and that unfortunately proved to be an everlasting source of troubles and delays. A lot of time was also lost for some very trivial difficulties to consult the AIS maps, during the lemma database creation, and for the lack of automation in the data acquisition and processing. A typical nightmare: the retrieval of a lemma, or a string, in the 8 volumes and 1705 pages of the AIS atlas, could require an eternity compared to a digital query. That, en passant, explains why the AIS index volume was published 20 years after the AIS publication (Jaberg/Jud 1960), without giving an adequate solution: it could only list some prototypical forms, for ex. “zisila” (swallow), but obviously not the possible variants (to give an idea with only few examples in the Veneto region: [ϑižilα, sī́ligα, ṣẹẓíla, sįẓī́a, ṣiẓī́ɫa, sįzī́ẹ, ṣiẓíla, sizī́a, ϑī́ria, tsíriẹ, tsíɫa, ϑī́riga, sī́ligẹ, ϑī́liga]), that instead a digital database could assure, as the AIS Reloaded project intends to do.

The precious experience gained in the Trentino project, and the development of the open source and open data access concepts, that in those years began to develop, and that ended up to be actually formulated in 2016, with the name of FAIR principles, allowed the author to design on completely new bases the Multimedia Atlas of Veneto Dialects (AMDV) (Tisato/Barbierato/Ferrieri/Gentili/Vigolo 2013) (Ch. 7). From the previous project, the AMDV project inherited the innovative idea of studying the linguistic dialectal variation using the AIS atlas as the reference touchstone, but added to this an approach based on the current geolinguistic methodologies. At that time, however, not everybody could appreciate the meaning and the importance that they could also have in the linguistic field. A colleague of Alberto Zamboni (the great expert for the Veneto dialects, Zamboni 1974, member of the AMDV team), learning of the intention to create an AIS digital atlas, did not believe it was a good idea:

Why ever embark on such a complex enterprise, when it is enough to enter the department library and consult the volumes without the least effort?

Without the least effort? Considering the weekends, the countless Italian calendar holidays, the time periods reserved for researchers, and for students, the strikes and diseases of the guardians, the evening and night closure, there was little hope of entering the library.

How to share the few hours available with the 3-4 people of the AMDV team, who needed to consult the atlas at the same time?

On the other hand, what to tell people who, at that time, had to travel hundreds of km to get to the library?

What's the problem? It is obviously possible to photocopy the atlas pages.

Obviously? It wasn’t imaginable to plan the AIS acquisition, on a normal A4, or even A3, scanner, without to compromise the integrity of the AIS volumes. The scan did imply: turn upside down the volume with the help of someone, give the correct orientation to the page, repeat this step 4 times to cover the entire surface (44x58 cm) for each of the 1705 pages.

Thus the commitment to carry out the NavigAIS and AMDV projects, avoiding that happened to the Trentino adventure, forced to give a turn to the linguistic karmic wheel.

2. Keywords

AIS atlas, linguistic geography, Italian dialects, Swiss dialects, Veneto dialects, NavigAIS, AMDV, AISr, AIS Reloaded, FAIR principles, interoperability, linguistic FAIRness.

3. Project Chronology

1998-2003 – Realization of Il Trentino dei Contadini, the (little) talking atlas of the Trentino dialects by the MUCGT (Museo degli Usi e Costumi della Gente Trentina) (San Michele all’Adige, Trento) (Mott/Kezich/Tisato 2003, Fig. 3, 4, 5, Ch. 4).

2009-2017 – Development of NavigAIS, a high resolution digital version of the AIS atlas, born to allow the navigation of the 1705 maps contained in the atlas, and to acquire the whole AIS text by an appropriate OCR (Optical Character Recognition) (Ch. 5).

The NavigAIS project developed in three phases:

  • 2009 – A batch Matlab version was implemented for the AMDV project that was then starting (Ch. 5.1) (info and download at https://www3.pd.istc.cnr.it/navigais) (Tisato 2010). NavigAIS had to guarantee the AMDV team the real time consultation on field, so to check immediately the speaker’s answers (Fig. 17, 18, 19), and the subsequent creation of the AIS lemma database.
  • 2014 – NavigAIS online version was realized to give all researchers and students an easy and quick access to the AIS atlas according to the FAIR principles (https://www3.pd.istc.cnr.it/navigais-web) (Fig. 2, 20, 21) (Ch. 5.3).
  • 2015-2017 – The NavigAIS final version integrates an OCR (Optical Character Recognition), that was specifically developed for the AIS Reloaded project to complete the AIS text digitization in acceptable times (Tisato 2019) (Ch. 6). Half of the work (i.e. 880 maps of the 1705 total, with about 700,000 entries in all) has already been completed by the end of 2019, and the related database is now accessible and downloadable (https://www.ais-reloaded.uzh.ch).

2009-2015 – AMDV – Multimedia Atlas of Veneto Dialects (Tisato/Barbierato/Ferrieri/Gentili/Vigolo 2013, Tisato/Vigolo 2016) (Ch. 7). The project has been carried at the Institute of Cognitive Sciences and Technologies (ISTC-CNR), and the Department of Linguistic Disciplines, Padua University, by an interdisciplinary team1, and funded by the Padua Cariparo Foundation. It offers a multimedia, ethnographic and etymological framework to investigate the diachronic evolution of the Italian dialects occurred in 90 years in the Veneto region, by a recording campaign in the same locations subject of the AIS surveys, and using the same AIS questionnaire (http://www.pd.istc.cnr.it/amdv).

The paper discusses the approach adopted for the acquisition, coding, indexing, and for the information retrieval and delivery of the AMDV documents, and the software tools developed to extract phonetic and spectral parameters from the speech collected materials, and to code them in a consistent way with the FAIR principles (Ch. 7, Ch. 8).

2016-2019 – AISr – AIS Reloaded (Loporcaro/Schmid/Pescarini/Tisato/Donzelli/Negrinelli/Zanini 2019). The project has been carried out at the Romanisches Seminar (RoSe), Zurich University, and funded by the Swiss National Science Foundation (Ch. 6).

It intends to:

  • Achieve an online searchable database of all the AIS lemmas (at the end 2019, 880/1705 maps were acquired with about 700,000 entries). According to the FAIR principles, the data are already available in the AISr site: https://www.ais-reloaded.uzh.ch.
  • Carry out an investigation on 36 AIS points in Canton Tessin and Grisons, collecting data in the same AIS localities and using the same questionnaire, to study the dialect lexical evolution in one hundred years (http://www.rose.uzh.ch/de/forschung/forschungamrose/projekte/AIS-reloaded.html).

Fig. 3 – Il Trentino dei Contadini main window shows the simplified transcription used in the atlas: for ex. in Roncone [sigruSèl, manårìn] instead of original AIS [ṣį̀gruẓę́l, manαrī́n] (https://www3.pd.istc.cnr.it/navigais-web/?map=547&loc=roncone).

Multimedia Atlas of Trentino Dialects (main window)

4. Lesson to Learn from the Trentino Atlas

A talking Atlas of the Trentino dialects has been realized by Antonella Mott, Giovanni Kezich, and Graziano Tisato, at the Museo degli Usi e Costumi della Gente Trentina (MUCGT) (San Michele all’Adige, Trento) (Mott/Kezich/Tisato 2003, Fig. 3, 4, 5) with the name Il Trentino dei Contadini, or the more explicit title Il Trentino dei Contadini - Piccolo atlante sonoro della cultura materiale : le parole e le cose della ricerca di Paul Scheuermeier (1921-1931) e le voci della tradizione di oggi (1998) (The Trentino of the Peasants - Small talking atlas of material culture: the words and the things of the Paul Scheuermeier research (1921-1931) and the voices of today's tradition (1998)).

The MUCGT museum, which could boast a lot of very high quality editorial projects, had also the merit of rediscovering the Scheuermeier work from an ethnographic point of view, and of realizing a (small) talking atlas, which was funded by the Cassa di Risparmio di Trento e Rovereto Foundation (Fig. 3).

The atlas allowed an exploration of the peasant world with some (at that time) interesting innovations in the field, in particular, introducing the talking photos (Fig. 4), and providing an interactive sonogram to study the phonetic features of the Trentino dialects (Fig. 5).

Fig. 4 – Talking photo in Il Trentino dei Contadini.
Scheuermeier picture at Peio municipal dairy, 17 June 1921.
Moving the mouse over the objects reveals the object label, and allows listening to the dialectal speech.Multimedia Atlas of Trentino Dialects (talking photo)

The author was involved in the project in 1998, after the museum had lost two years in fruitless attempts to carry it out. The completion took 5 years (1998-2003) for the following causes:

  1. The difficulties to continuously manipulate and consult the AIS volumes having the unusual size of 44x58 cm, about an A2 sheet.
  2. The delays to assign a lemma to the desired location, and to retrieve the lemmas, that the AIS compilers had been forced to move in the legend, or anywhere else on the page, for the limited space available on the map (https://www3.pd.istc.cnr.it/navigais-web/?map=1569&zoom=8&point=6400,12700,440,140).
  3. The carrying out investigations by an Italian only speaker, so that, many times, despite all the recommendations, the informant felt obliged to courteously respond in Italian.
  4. The showing the informant the images to comment on a series of flying sheets (for the need to randomize the questionnaire), with the consequence of wasting a lot of time looking for the right image, and with the production of random noises and rustles.
  5. The delays of the audio post-processing, due to the difficulties to find the word to be processed, or to be re-examined, in 7-8 hours of continuous digital recording.
  6. The use of a simplified symbolism instead of the AIS original one (Fig. 23). The decision was taken for legibility reasons, and for the difficulty of transcribing the very complex AIS phonetic inventory, with 71 vowels, 172 consonants, 40 diacritical symbols distributed on 7 different levels (Fig. 22), and 20 punctuation symbols (https://www3.pd.istc.cnr.it/navigais-web/AIS_symbols.htm), and that ended up creating many more problems than it had to solve.

In fact, while the first 5 points caused a mere delay in the atlas realization times, the invention of an “simple” phonetic system proved to be a linguistic headache, since a dubious replacement of a symbol instead of AIS original one, obliged the transcriber to consult an expert, and neither of them, for the lack of the original sound source, could be certain of the correctness of their decision.

See an example in Fig. 3, the Roncone transcription [sigruSèl, manårìn] instead of original text [ṣį̀gruẓę́l, manαrī́n] (https://www3.pd.istc.cnr.it/navigais-web/?map=547&loc=roncone).

Fig. 5 – Il Trentino dei Contadini
Waveform, sonogram, pitch (red) and intensity (blue) for a dialectal lemma of Mortaso.Multimedia Atlas of Trentino Dialects (sonogram, pitch, intensity)

5. NavigAIS - The AIS Navigator

5.1. NavigAIS Stand Alone Version

Having learned the Trentino lesson on one's own skin, at the start of AMDV project (Ch. 7), we decided to realize a high resolution navigable version of the entire AIS atlas, although it was not initially planned by the AMDV project, focused only on the dialects of the Veneto region. We chose however not to limit his implementation to the Veneto only, or only to the maps strictly required by our project, but to extend it to all the Italian regions and all the 1705 AIS maps, aware that a lot of people needed the same resource.

The NavigAIS download was available in late 2009 (https://www3.pd.istc.cnr.it/navigais) (Fig. 17, 18, 19).

To realize the digital atlas, it was necessary to spend some weeks in the search for the right scanner, asking publishers, pressmen and photographs, until to discover, by a lucky chance, a Zeutschel OS 10000 color scanner in the municipal archives of Padua, which could work at 600 dpi, supported the A1 format and provided a book cradle (Fig. 6). This kind of scanner can acquire a double A2 page, placing the book on 2 balanced plates, and the book spine in the gap between the two. The machine applies an uniform compression of the two pages against the scan glass surface, avoiding in this way any deformation.

Fig. 6 – The AIS maps were digitized by a scanner Zeutschel OS 10000 with 600 dpi resolution.
The resulting map size was about 10000x13000 px (420 MB/map - 716 GB total).Scanner Zeutschel OS 10000 used to acquire AIS maps

To complete the task in acceptable times (before the start of the AMDV project), the entire processing chain was automated to require none, or minimal, human supervision (Ch. 5).

The 1705 acquired maps were subjected to a preliminary 6-step processing, which could be executed and inspected one at a time, or carried out all together (Fig. 7) (Ch. 5.1). The elaboration was needed to improve the visual rendering of the noisy original maps (rotating the maps, enhancing the contrast, eliminating the background noise, etc.) (Fig. 16 before and after the processing), and to automatically separate the text (black) from the background (orange) on two different levels, with the aim to assure the best results to the final automatic word recognition step (Ch. 6). Without this separation, in innumerable cases, the borders and orographic lines would prevent the correct recognition of the sequence (for ex. map 548, point n. 71, 115, 116, 118, etc., Fig. 27, NavigAIS-web K. 548.71).

Fig. 7 - NavigAIS - The 6 steps to elaborate the AIS maps.NavigAIS 7 steps  Processing

 

Step 1 - Image Rotation

The first step of the AIS map processing must try to correct the page rotation inevitable in the scanning process. This is important for the map visualization, but indispensable for the subsequent text recognition task. To achieve this job in an optimal way, the program exploits the orange borders present in all the AIS pages (except the prefaces). First of all, the Matlab procedure must separate, on the orange color basis, the background, the borders lines, the location identifiers and the rectangle frame from the text. Then we extract the image edges using the Roberts method of approximation to the derivative2. The edges are defined at the points where the gradient of input matrix is maximum. Then, the rotation angles of the frame sides can be computed with a Radon transform, which has the remarkable capacity to extract lines and curves from very noisy images. The Radon transform works projecting (i.e. summing up) the image intensity on a line, which inclination angle varies in a specific range (in our case between -2° and +2° with an increment of .02°).

The original image is then rotated with each one of the 4 candidate angles, and the Radon transform is recalculated again to find the angle giving the best result. In this way we try to eliminate the errors induced by the book spine deformation, which could alter some of the side lines, but less frequently all the lines, and by the interference of the contiguous page sometimes captured in the current frame.

Fig. 8 – NavigAIS Initial Scan Step - Fig. 9 – Step 1: Map rotation (NavigAIS-web K. 225).NavigAIS - Map rotation

Step 2 - Image Cropping

The procedure automatically cut out the image to reduce the size to the minimum possible, with the sufficient intelligence to avoid the inclusion of a part of the contiguous page (Fig. 10). Moreover all the maps are aligned in the same way, so that, in a future version of the navigator, we can use one unique background for all the different maps.

Step 3 - Contrast Enhancement

We need then to prepare the map for the next step, adjusting the image contrast. In this case the intensity values of the rotated image are weighted to lower values to produce darker colors (Fig. 11).

Fig. 10 –Processing Step 2: Map cropping – Fig. 11 Step 3: Map enhance contrast. (NavigAIS-web K. 225).NavigAIS - Map cropping - Map contrast enhance

 

Step 4 - Separating the Foreground and Background Image Components

In this stage the orange background is isolated (Fig. 13), and then subtracted to the whole image to obtain the text separation (Fig. 12).

Fig. 12 – NavigAIS Processing Step 4: Text separation. – Fig. 13 – Background sep. (NavigAIS-web K. 225)NavigAIS - Text and background separation.

Step 5 - Filtering the Noise

We run a median filter on the resulting image, i.e. a nonlinear transformation used to reduce the so called “salt and pepper” noise. The advantage of a median filter is that it is more effective than other algorithms (for ex., convolution), when the aim is to reduce the noise and at the same time preserve the edges (Fig. 14). We repeat the same process on the output image, to obtain the final foreground component (Fig. 12).

Step 6 - Saving the Foreground and Background Map Components

The two definitive matrices, containing the image text and the background, are logical masks made only of 0 and 1. Matlab for some historical reasons stores a logical variable in 8 bits instead of one only, wasting a lot of space. So, it was necessary to write a low-level compression routine that could force the efficient memorization of this kind of data. With this expedient, the entire AIS shrank to 2.72 GB (about 270 times smaller than the original size of the scanned images). The background use this bit as a mask to red color the background (Fig. 15).

Fig. 14 – Step 5: Salt and Pepper filtering. – Fig. 15 - Red colored background - Reunification of the 2 levels (NavigAIS-web K. 225)NavigAIS - Salt and Pepper filtering. – Red colored background

Fig. 16 – NavigAIS Processing: Original noisy map (left), final processed map (right) (NavigAIS-web K. 547).NavigAIS - Compare AIS map before and after processing

NavigAIS Navigation Software

A Matlab graphical user interface, called NavigAIS, was created to allow an easy navigation of the maps.

NavigAIS presents 3 windows (Fig. 18). To travel in the AIS map, it provides an overview window, which is a miniature of the entire map (left box, Fig. 18): a blue rectangle signals the map position and can be dragged to visualize the wanted zone in the main window. The main window displays the AIS maps at the desired magnification ratio. The top of both these windows presents a toolbar with some buttons. They allow: a) the zoom in and zoom out of the map, b) to move from a locality to another in sequential or predefined order, and c) to print and save the current image. This functionality was exploited in the AMDV realization to acquire an optimal reproduction of all the Boesch drawings at 600 dpi, later used in the AMDV atlas (Fig. 2, 20, 42).

The third window offers a string search on the AIS map index, and on the location names (Fig. 19). It is possible to preselect the desired locations to explore sequentially. This functionality was created to automate the OCR programmed final phase, which required to successively move to the investigation poins of a user scheduled list.

5.2. NavigAIS Role in the AMDV Project

The NavigAIS batch version did play an essential role in the AMDV project realization, in two fundamental steps of the processing:

  • First of all, since July 2009, NavigAIS was used in the real-time check on field of the informant answers (Fig. 17, 18). With the NavigAIS feedback, the linguist could instantaneously verify similarities and differences with the lemmas transcribed in the AIS maps 90 years before, in the same and nearby localities, and interact during the recording sessions (Fig. 17). This kind of facility, never used before in the on field investigations, gave the AMDV researchers a great advantage both in terms of lexical precision, and number of recorded lemmas (14500 for 430 maps in the Veneto region), if compared with the results (11600 words for the same 430 maps) collected by Scheuermeier.

 

  • The second step, in which NavigAIS was used, concerned the AIS lemma database creation. In this case, the task was to sequentially, or in a desired order, explore the 26 spots of the Veneto region, and to transcribe the words in the database. This could be simply obtained selecting the Veneto region in the proper list box (box 12, Fig. 19), and then moving around to previous or next spot in a liable way. After the word insertion in the database, the NavigAIS visualized the saved word and the original graphic image, side by side, to allow the user to check its correctness (bottom box, Fig. 18). According to this procedure, the AIS database was completed in two only months with an error rate smaller than 2%.

Fig. 17 – NavigAIS Batch Version (2009)
AMDV recording session: the 15 October 2009, Romano d'Ezzelino. On the left, the linguist Alberto Zamboni compares the speaker answers with that transcribed on the AIS atlas (NavigAIS-web K. 547.354). The SyncRec software shows the informant (Angelo Dissegna) an image to comment, on the screen in front of him. The audio manager (out of range) supervises the speech acquisition on his workstation (bottom left corner).NavigAIS - Recording session with Alberto Zamboni

Fig. 18 – NavigAIS Batch Version (2009) shows the AIS map n. 225 "la pialla" (the planer).
Navigation window (left), search window (right), AIS digital map (background) (NavigAIS-web K. 225.35).
The AIS localities are identified with the related names to facilitate the work.
During the database check, the bottom box presents the acquired lemmas to compare with the original AIS text directly on the map. NavigAIS shows the AIS map n. 225 (la pialla the planer)

 

Fig. 19 – NavigAIS search window, with the map (arrow n. 1, 2, 4, 16) and locality (9, 11, 13) search box,
the point selection by province and region (12), the button to move to the previous or next map (3), and previous or next point (5), to show the inquiry place names (7), and the background (6), and to display the OCR window (14), the current map (15), the window with the configuration options (17).NavigAIS - The navigator window

5.3. NavigAIS Online Version

In 2014, an online version of NavigAIS was created to meet many requests from people, who did not use the Windows operating system, or had difficulty installing the stand alone version on a virtual machine (Fig. 2, 20, 21).

The software is written in pure JavaScript and Microsoft Ajax (Asynchronous JavaScript And XML, which is a dominant technology for Web services), to simplify the web software implementation. In this way the data exchange with the server takes place asynchronously, without interfering with the client side processing, and allows to refresh only the part of the map, that is being visualized. The original maps were resampled at 15 levels of resolution, and the resulting images where then divided in a mosaic of 256x256 px tiles, overlapping by 1 px the neighboring ones, with a total of 2622 images per map, and a resulting amount of 4,470,510 (1705*2622) files for the entire atlas.

Fig. 20 – NavigAIS Online Version
Direct access to Paul Boesch drawings on the AIS map n. 1329
https://www3.pd.istc.cnr.it/navigais-web?map=1329&point=3400,6600&zoom=2.5NavigAIS online version shows the AIS map n. 1329 with Paul Boesch drawings

The NavigAIS online architecture had the advantages to avoid the software installation procedure, but, more important, to give world wide access to the AIS, making NavigAIS compliant with the FAIR principles.

As to regards the language field, the syntactic interoperability is the capacity to process the interchanged information with simple or no conversion. Semantic interoperability refers to the ability to interpret exchanged linguistic data in a coherent way, according to a common protocol. NavigAIS implements an elementary structure of the metadata that not presents ambiguity in the interpretation: map name, and map number identifier, inquiry locality name, and inquiry number identifier, point coordinates [x, y], rectangle coordinates [x, y, width, height], zoom factor.

The database containing the map subjects (6900 items), the numeric labels of the 1705 AIS maps, the database of the AIS 407 inquiry locality names and the related numeric labels, are all integrated in a JavaScript software file of 370 KB, downloaded and cached on the side client. The total size, adding the HTML document and the CSS style file, is smaller than 400 KB.

Fig. 21 – NavigAIS Online Version
Direct access to a conjugation (remote past tense of the verb “essere”, to be)
https://www3.pd.istc.cnr.it/navigais-web?map=1700&point=3500,6900&zoom=3NavigAIS online version shows a conjugation (remote past tense of the verb “essere”, to be)

The queries to the software are built simply adding to the usual web address, i.e. the URL (Uniform Resource Locator), the string parameters (or URL variables) as follows:

5.3.1. Optional Parameters

1 - map=[xxxx] (xxxx n. of the AIS map [1-1705]), for ex.:

http://www3.pd.istc.cnr.it/navigais-web?map=1401
(Open the map n. 1401 at the 1° inquiry location, Brigels)

2a - point=[yyy] (yyy is the ID point [0=Legend, 1-990]), for ex.:

http://www3.pd.istc.cnr.it/navigais-web?map=1434&point=336
(Open the map n. 1434 at the locality n. 336 - Ponte nelle Alpi)

2b - point=[x,y] (x,y absolute coordinates of the point in px in the range x=[1-10000], y=[1-13000]), for ex.:

http://www3.pd.istc.cnr.it/navigais-web?map=1329&point=900,6000
(Open the map n. 1329 centered on the point with coordinates x=900, y=6000)

2c - point=[x,y,w,h]
x,y abs. coordinates of the left corner rectangle in px x=[1-10000], y=[1-13000];
w width and h height of the rectangle in px w=[1-10000], h=[1-13000], for ex.:

http://www3.pd.istc.cnr.it/navigais-web?map=1569&zoom=6&point=6400,12700,440,140
(Open the map n. 1569 centered on the point x=6400, y=12700, and draw a blue rectangle around a word moved off the map)

3a - loc=[location name] (location name is the inquiry location name).

http://www3.pd.istc.cnr.it/navigais-web?map=547&loc=Teolo
(Open the map n. 547 centered on Teolo)

3b - loc=[cardinal direction] (nw/northwest, nh/north, ne/northeast, we/west, ce/center, et/east, sw/southwest, sh/south, se/southeast).

http://www3.pd.istc.cnr.it/navigais-web?map=547&loc=nw
(Open the map n. 547 centered on the northwest quadrant)

4 - zoom=[zoom factor] (0.1-40).

http://www3.pd.istc.cnr.it/navigais-web?map=1329&point=2000,6000&zoom=10
(Open the map n. 1329 on the point with coordinates x=2000, y=6000, and a zoom factor of 10x)

5.4. How to Embed NavigAIS into a User Document

Regarding FAIR reusability requirement, NavigAIS consents to embed, as an object, the entire AIS atlas in HTML documents as WordPress, Visme, Google Slides, etc., with a single line of code obtained from the NavigAIS command bar ("Embed Code" button in Fig. 1a, 1b), or copied from Ch. 5.4.1. The code contains the URL string to access the required AIS map in a certain locality or position and with a certain zoom factor (https://www3.pd.istc.cnr.it/navigais-web/navigais_embed.htm).

Fig. 1b – NavigAIS, embedded in the current WordPress document

5.4.1 - Copy and paste the following code in your HTML page:

 

5.4.2 - Optional Parameters

● The NavigAIS container dimensions [width:"1000" height:"500"] can be modified according the needs.

● The call to NavigAIS accept the 4 optional parameters (map, point, loc, zoom) described in the previous chapter Ch. 5.3.1.

 

Fig. 22 – Levels of the AIS diacritics with respect to the basic glyph (magenta).AIS glyph structure

6. AIS Reloaded Project

In 2016, Michele Loporcaro and Stephan Schmid of the Romanisches Seminar (Zurich University) accepted the proposal to acquire the text of the entire AIS, and create a searchable and downloadable database. The project, with the name AIS Reloaded (AISr), submitted to acquire half of the entire atlas within three years (2016-2019), was funded by the Swiss National Science Foundation with 684,000 CHF. The expected result was regularly achieved at the end of 2019 (https://www.ais-reloaded.uzh.ch), and an additional funding has been obtained for the programmed completion of the work (2021-2024).

The AISr project had also the aim to create a comparable corpus for the Southern Switzerland dialects, collecting new data one hundred years later into the original AIS locations, and using the same questionnaire. These data are an indispensable requirement for documenting the diachronic change in the corresponding dialects in the elapsed century (http://www.rose.uzh.ch/de/forschung/forschungamrose/projekte/AIS-reloaded.html).

Fig. 23 – AIS Phonetic Inventory https://www3.pd.istc.cnr.it/navigais-web/AIS_symbols.htmAIS phonetic inventory

6.1. How to OCR the AIS

How is it possible to extract the text from an image (represented in the digital field as a numeric matrix of pixel, i.e. picture element, the smallest visible tile of the original mosaic)?

A possible approach is the pattern matching. With the first phase, the OCR training creates the templates of all the characters to be recognized. In the second phase, an unknown input character is identified, according certain criteria, as the most similar to one of the templates.

In the elementary example of Fig. 24, the A, B, C red colored characters on the right represent the created prototypes. The input character to be recognized appears on the left.

Each square represents a pixel of the image, with his color number identifier (in this case, for convenience, 0=blue=background, 1=red=text).

Then, applying a very basic pixel matching rule, the input character is compared with the templates simply counting the number of pixels that do not coincide (indicated with "x" in the Fig. 24): the template that gives the least number of non-coincidences could be the character sought.

With this rule, we obtain 17 px non-coincident for the first template (A), 6 px non-coincident for the third template (C), while all the pixels in the second template (B) are coincident, consenting to identify the input as B (green oval in Fig. 24).

Such a recognition system, obviously, has an interest limited only to characters of unchangeable shape and size. In the case of ancient documents and handwritten characters, we need an algorithmic description of the glyph shape (for ex. by a polygonal approximation, blue oriented lines in Fig. 25), that does not depend on inclination, size, shape and degradation conditions of the characters (for ex. ruined [o] in Fig. 25).

In this way, a feature vector is extracted, which allows to measure a distance in an N-dimensional space of the unknown character from the training prototypes (the black [o] in Fig. 25).

The realization of efficient algorithms to extract the feature vector, and the developments of innovative methodologies and technologies involve now a large number of research areas, as ICR (Intelligent Character Recognition), Machine Learning, Pattern Recognition, Deep Learning, etc.

Without going into the details, that can be found in the specific literature, we just mention the state of the art, giving here the results (obtained, just before the start of the AISr project, by Vithlani and Kumbharana, 20153) of a test to recognize a sequence of 36 characters (English, alpha-numeric, capital, isolated), repeated 10 times, handwritten from 7 people of various ages. The sequence was processed by 6 stand alone and online OCR software. The Custom OCR Online gave the best results, with an recognition accuracy of about 44%, or a CER (Character Error Rate) of 56%, very far from the rate close to 100%, obtained in the case of printed characters (when obviously they are not degraded, or affected by noise).

In reality, the results of common OCRs on the AIS atlas are much worse than those of the described test.

In fact, the AIS phonetic inventory contains a very large number of low-frequency symbols (Fig. 23), with a pattern that can be described by the Zipf law (Zipf, 1949). The inevitable consequence of this distribution is the CER worsening of any OCR based on a probabilistic engine.

This is a problem that prevents also the approach with neural networks for the enormous time required for the OCR training phase, which must be based on a representative number of real samples extracted from the whole atlas. In the case of AIS, the number of symbols to be recognized is very high (~300), for they correspond to languages of many regions and to great linguistic variability. If we want to take even a sample of the 300 symbols for each of the 8 AIS volumes, the minimum representative set should consist of 2400 images.

The AIS symbolism presents other features that make inefficient the use of normal OCRs: the glyphs are italics with a 70° inclination, overlapping the neighboring characters (Fig. 33), traced by hand by different people, and use 25 diacritics distributed on 7 different levels (Fig. 22).

Fig. 24 - OCR rudimentary engine based on the pixel matching between prototype and a submitted character. All the red pixel of the B template (green oval) coincide with the processed character (first column).NavigAIS - Pixel matching engine

Moreover it must be added that:

  • The OCR accuracy depends on the quality of scanned maps, that are almost a century old, and not in good conditions.
  • There is no model of language(s), which could give the probability of a character string (for ex. bi-grams, tri-grams, n-grams), and word succession, helping the OCR in the right sequence selection.
  • There is no dictionary, which could list all the existing words in the dialect(s), and could assign the correct form among various candidates.

The task requires also long execution times, both for the number of entries it intended to acquire (about 1705 maps x 407 investigation points), both for the training and implementation of the OCR, and for the final result validation.

Fig. 25 - OCR Scheme. The classifier find the distance in feature space to the ideal prototype vector (for ex. by kNN - Nearest Neighbor), or divides feature space into regions related to different classes (for ex. by SVM - Support Vector Machine). NavigAIS - OCR Scheme

Before implementing NavigAIS, three OCRs were considered: the two most popular commercial software at the moment, Omnipage (Nuance, www.nuance.com), and FineReader (ABBYY, www.abbyy.com), and a free software, Tesseract (developed by Hewlett Packard between 1985 and 1994, which became open source in 2005 with Apache 2.0 license, sponsored by Google for about ten years, and now downloadable from the repository: https://github.com/tesseract-ocr).

A first examination revealed that Omnipage did not ensure an adequate management of the diacritics, requested for the AIS symbols, and was discarded. FineReader allowed the coding of AIS characters and diacritics according to the Unicode standard. As for the training phase, the software accepted the graphic images of anomalous characters with the related identity, although the mechanism was very laborious: in fact, for the result to be acceptable, the identification operation had to be repeated on thousands of glyphs. It should also be observed that this processing was fairly approximate, as the images captured by FineReader could not be cleaned up of the lines belonging to the neighboring characters, with relative effects of worsening the CER errors.

However, where FineReader revealed its inadequacy was the impossibility of integrating the software into the overall processing. In fact, the possible automation consisted in depositing the desired user graphic image in a so called Hot Folder, subsequently taken and recognized by the program. Unfortunately, the action of the OCR, programmed for the normal Office Automation functions, was asynchronous and triggered at discrete time intervals with a minimum delay of 1 minute. Such a limitation resulted in unacceptable acquisition times for the 1.4 M words of the AIS, and ended up making manual transcription more advantageous.

Finally, it was discovered that the automation mechanism only provided for the use of predefined standard languages, and prevented the use of a user-trained language as the AIS.

6.2. Embedding an OCR in NavigAIS

Tesseract offered significant advantages:

  • It provided all the source code.
  • It could be easily integrated into the processing chain.
  • It had a CER error rate comparable with the most prestigious commercial software.

A Tesseract version was compiled as an Mex Matlab file, and incorporated in the NavigAIS processing loop to obtain immediate results.

In this way the software recursively moves to the AIS locality(-ies) selected by the supervisor, and waits for interactions. The main task of the supervising person is to choose the right candidate for recognition between the possible lemmas near the current point. As anticipated, this is not always a trivial task (Ch. 4). With the supervisor indication, the software can proceed automatically selecting the zone containing the phonetic sequence (blue rectangle in Fig. 27, 29), and processing the chosen image by the OCR.

The Tesseract engine was properly trained to recognize the AIS symbols on different sets of real characters taken from the maps, or artificial AIS-like font, that allows to easily prepare the test sequences (Fig. 31).

Obviously the CER was obviously still unacceptable, despite the training phase: as you can see in Fig. 31 (right column), the CER error rate was about 28% , still too high, considering that the OCR output strings required long checking and correction times.

6.3. The solution: Decompose the Problem

The approach was then to decompose the problem into a more manageable one, according to the concept of decision tree. The AIS symbols were separated into the base glyph, easily recognizable by a normal OCR, and the diacritics elaborated with a adequate software. The diacritics too are divided according to the geometric properties and the relative position, and processed with a Matlab code (used for the NavigAIS interface) (Fig. 28-29).

In this way it is possible to use Tesseract to recognize the base glyph alone with an optimal CER error rate, and use Matlab to process the diacritics, and then to apply a set of rules appropriate to the AIS symbolism (Fig. 30).

The image undergoes three elaboration phases (Fig. 26, 33):

  • A pre-processing step is performed to discard the lines belonging to nearby lemmas and/or touching the blue borders, and to eliminate noise and anomalous spots (with the so called Salt & Pepper filtering) (Fig. 28).
  • We adopt a segmentation-then-recognition approach, to get from Tesseract the character bounding boxes (blue draggable resizable rectangles Fig. 33), useful in the training phase, and in the next post-processing step, and useful also to obtain a preliminary classification of the characters.
  • We execute then the final post-processing with a set of Matlab rules, based on geometric and topological properties, and specially designed to process the diacritics (Fig. 29-30). This phase also provides for the diacritic check (i.e. relative position and compatibility with the basic glyph), and, in general, for the conformity of the resulting character with the phonetic inventory. Faults and errors are reported visually in red color to the supervisor.

Finally the supervisor controls the OCR output, and can manually correct the error(s) with the virtual keyboard, and lets the software memorize the sequence in the database (Fig. 26-27).

After saving, NavigAIS moves to the next point and repeats the procedure.

Fig. 26 – NavigAIS OCR - On the left the map n. 1325, “la botte” (barrel) with the navigation window. The OCR (red blocks) receives the graphic area containing the words to recognize (green rectangle). On the right, the virtual keyboard to edit the OCR string, then automatically saved in the (blue) database. On the background, the current Excel sheet of the database, where the OCR save the lemmas (box 1).
www3.pd.istc.cnr.it/navigais-web/?map=1325&point=10NavigAIS - OCR block diagram

6.4. OCR Result Evaluation

For the evaluation of the recognizer's performance, various tests were carried out which concerned both the stages of the OCR training and the actual recognition step.

For the training phase, a minimal test was prepared with 50 words (approximately 500 characters in total) from the AMDV database.

The test was intentionally limited, to quickly test the effect of the changes occurred over time in the test set, and to compare the accuracy of the various trained languages and of a language with each other.

The Fig. 31 summarizes the OCR results with languages ​​trained on different training sets, taken on their own or in combination with each other.
The three vertical columns report the CER for the three main test modes:

  • OCR Tesseract alone without character spacing (Fig. 31, right).
  • OCR Tesseract alone with character spacing (Fig. 31, center).
  • OCR with post-processing (Fig. 31, left).

The increase of the character spacing brings a two percent CER improvement to 26% (central column, Fig. 31), and, finally, a Matlab post-processing, with 350 appropriate rules, produces a more significant gain of about 25%, and a final CER equal to 1.41%.

The best result (error of 1.41%) is gained by the combination of various training set (ais01, ais09, ais10, ais11, ais12), and by the execution of the post-processing step, while the worst result, CER = 25.93%, is obtained by a set artificially built with a font similar to AIS, but evidently quite far from the graphic reality of the AIS.

Without the post-processing phase, but with the expansion of the characters, the CER increases and varies from 22% to 40%, while without the post-processing phase, and without the expansion of the characters, the CER increases from 24% to 40%, with a worsening of 2-3% compared to the case of character dilation.

The Fig. 31 also reports the WER (Word Error Rate), the error on whole words, which does not have much interest for AIS.

The acquisition time per point was less than 14 s for sequences of about 8.36 characters, more than halved compared to 30 s, initially valued acceptable for the project to avoid the manual transcription of the AIS text. Obviously, the sequence length must be taken into account. In fact, as the number of characters increases, the acquisition time increases.

The error rate, instead, does not depend on the length of the sequence recognized by the OCR and remains unaltered. The CER error can increase by a few percentage decimals in the "anomalous" sequences, in which the lines of the characters touch each other, and the deterioration of the characters is high. In these cases, in fact, the lack of a dictionary of dialectal terms prevents the OCR from correctly interpreting the damaged characters in a certain sequence.

An acquisition test on real data from 14 complete AIS maps for a total of 100,000 characters obtained an average CER error rate of 3.65%, with the so-called Levenshtein Distance4 that includes any insertion, omission and replacement of characters and diacritics.

Fig. 27 – NavigAIS OCR Scheme
In the upper right corner (box 2), the map and locality search window.
At the bottom, the virtual keyboard (box 7) to edit the string returned by the OCR (box 5).
The blue names on the AIS map (NavigAIS-web K. 548.71) are those not yet processed by the OCR.
On the background, the current Excel sheet of the database, where the OCR save the lemmas (box 1).NavigAIS OCR version main window

Fig. 28 – NavigAIS OCR operates by separating the characters from the diacritics. The post-processing phase uses a set of ad hoc rules in the absence of dictionary(-ies) and language model(s).NavigAIS OCR Procedure

Fig. 29 – NavigAIS OCR post-processing. The 10 red colored errors (characters + diacritics), present in the Tesseract output, have been automatically corrected by the final post-processing phase.NavigAIS OCR post processing phase

Fig. 30 - NavigAIS OCR: Scheme of the rules applied in the post-processing phase.NavigAIS Post Processing Rules

Fig. 31 - Comparative results with different training sets. The three columns report the CER for the three main test modes: OCR Tesseract alone without character spacing (right) with a CER=28.1%, OCR Tesseract alone with character spacing (in the center) CER=26.5%, OCR with post-processing (left) CER=1.7%.NavigAIS OCR test results

Fig. 32 - The CER error results on 14 AIS maps for a total of 100,000 characters. The average CER is 3.65% for the 14 maps, and ranges from 1.35% to 5.87%. The acquisition time per lemma is on average lower than 20 s per AIS entry, and varies between 13.63 s and 25.31 s according to the number of character recognized [3095-9057].NavigAIS OCR CER Results

Fig. 33 - Detailed example of OCR results for a single lemma: with Tesseract alone the CER error is 67%. With the character spacing (without post-processing on the OCR results), the CER error is reduced to 40%. Finally with spacing and post-processing, all characters are recognized correctly with CER = 0% (NavigAIS-web K. 1327.177).NavigAIS OCR CER errors example

Fig. 34 – SyncRec control score: the lines 5 to 14 command an image sequence in an initial learning session, while the lines 16 onward provide the actual audio recording session. Each item in the list controls an event (displaying a text, an image, or listening to a sound file), used to label each related recording of the informant.AMDV: Syncrec schedule

7. Multimedia Atlas of Veneto Dialects

The Multimedia Atlas of Veneto Dialects (AMDV, acronym from Atlante Multimediale dei Dialetti Veneti) is an interdisciplinary project which brought together a team5 of dialectology, etymology, and voice processing researchers, with the purpose to create a linguistic atlas of the dialects of the Veneto region (North East of Italy), exploiting the current geolinguistic methodologies.

The AMDV is mainly focused on the diachronic comparison of the AIS lexical data, with the records collected in the same localities at the present time.

The AMDV was inspired by similar digital atlas ALEPO (Telmon/Canobbio 1985), ALD (Goebl 1998), VIVALDI (Kattenbusch 1998-2016b), etc., and, as mentioned in the Ch. 4, particularly by a work done some years ago on the Trentino dialects by the Museo degli Usi e Costumi della Gente Trentina, San Michele all’Adige (Trento).

Fig. 35a – The AMDV main screen. On the background NavigAIS (online version) called by the pushbutton (top right with the red arrow). On the right bottom, a Scheuermeier photo taken on February 2, 1922, in Tarzo (Treviso), of a farmhouse between Tarzo and Vittorio Veneto. On the bottom left, a card with the etymologies of the Veneto words to indicate the barn.AMDV main window and Scheuermeier photo

The AMDV purpose was the creation of a talking atlas which could give back the phonic reality behind the less or more questionable transcriptions of the traditional maps (Goebl, H., 1994).

The speech sound wants to be the “added value” of the project for it allows:

  • To study in an effective way the relationships between the lexical, phonetic and phonological aspects of the complex dialectal reality.
  • To give the user all the current processing tools useful to characterize the dialects (extracting the appropriate parameters as pitch, sonogram, formants, mapping the vocoids in a reference space, etc.) (Fig. 38).
  • To turn the unalterable transcription of the traditional atlases in a work in progress, subject to modifications and improvements by the user itself.
  • To transform the historical pictures (Fig. 35a, 37, 45) and drawings (Fig. 2, 20, 42) in talking documents which integrate the ethnographic and linguistic aspects of material culture in a powerful way .

The relevant aspect of the AMDV project was the development of a methodology appropriate to the large amount of linguistic data required6, and also adequate to the available human and financial resources, and the project temporal duration.

A great effort was made at the beginning to organize the data in a manageable way, and to introduce in all the planned steps the highest possible degree of automation regarding: the database building, the file labelling and saving, the post-processing of the audio-visual documents, the editing of pictures and drawings, etc. In fact, the automation of a trivial operation as opening and saving a file, which could take about 10 s, can lead to an enormous gain of time, if repeated tens of thousand of times.

As regards the automation process, we solved the tedious task of searching a desired word or comment in a continuous audio session, which could last many hours, segmenting the audio material directly on the field. A software, SyncRec, was programmed to present an image sequence (top right, Fig. 17), optionally randomized, and to launch the synchronous recording process. At the completion of the answer or comment for that image, the speech file was automatically labelled with the name used in the image list (Fig. 34), and saved in the appropriate directory. The software accepts an Excel or text file with the score items, and allows to mix different sources (audio files, written texts, videos, and images), according to a casual or sequential scheduling, so to minimize the interference with the subjects.

The SyncRec software could offer a little interest, if it was designed only as a recording tool. Adding the possibility to schedule the desired audio-visual events during the interview, and the automation to submit the questionnaire, and to save the records, gave the necessary flexibility to use it in different contexts and applications. The use of this tool allowed the collection of 14521 audio files and 2300 speech comments of very high quality without pops and clicks, without background noise, codified as the standard IEEE Float 32 bits, 96 KHz sampling rate, and normalized at -1 dB avoiding clipping and distortion effects, to provide the best conditions for the acoustical data extraction. For the same reason, the audio of the AMDV words where also recorded in isolation, and not extracted from the informant comment, to avoid the coarticulation effects and the cropping problems due to the continuous speech.

The SyncRec, even before an official release, has been used to acquire the audio material in a joint project of the University of Western Sydney and the Italian Research Council (CNR), Italian roots in Australian soil: Tracing regional linguistic heritage in first and second generation bilinguals, which aims to identify the cultural and linguistic elements of Italian Regional origins in the first, second and third generations in Australian-Italian families. The project exploited also the AMDV database to extract the phonetic data references for AIS 1921 words and AMDV 2009-2010 words, and used the AMDV comments to introduce the SyncRec recording sessions.

Fig. 35b – The AMDV main screen. The orographic map of previous figure has been replaced by the one showing the borders of the dioceses.AMDV main window with diocese map

Fig. 36 – The AMDV main screen: On the left the toolbar (to increase the font size, to access the annotated object list, the phonetic search window, the sonogram, the informant biography, the page print, the map zoom, the configuration options, the switch orographic/diocese map, the language interface, etc.AMDV with search box

The implementation of these tools allowed to face in an efficient way the task of the data collection in the 26 localities, which lasted from July 2009 to the end of January 2010, and the first phase of audio post processing from February 2010 to July 2010. We undertook a second recording round, which kept us busy from July to the end of October 2010, to correct some errors, substitute the noisy or poor sound quality lemmas, resolve doubtful cases, and insert other interesting lemmas, increasing the lexical corpus of about 20%.

The completion of the audio processing was followed by the phonetic transcription of the acquired material by Giacomo Ferrieri (Luciano Canepari collaborator), and finally by the database implementation, which lasted about one year.

As told in the introduction, the AMDV purpose was to provide an instrument useful in the study of linguistic modifications in the Veneto region between 1921 and 2009-2010. For this reason, in addition to the standard IPA transcription, we had to work out a parallel transcription according to the AIS convention. In this way, as we can see in Fig. 35a, 35b, it is possible to effectively compare the diachronic results using the same code: the background yellow color indicates the 1921 AIS word, the orange shows the AMDV 2009-2010 word transcribed with the AIS-like symbolism, and the magenta the AMDV 2009-2010 word with the IPA convention. The software gives the user the possibility to choose which of the three transcriptions must be optionally displayed, and the possibility to modify the font size, and to zoom in and out, and to drag the Veneto map.

The AMDV software core was implemented from January to November 2011, and was subjected to a thorough revision and refinement that lasted until September 2012.

In Fig. 35a, 35b, 36 we can see the AMDV main screen: in the top left corner, the buttons allow: to modify the size of the lemma font, to gain access to the annotated object list (Fig. 37) and the phonetic tree and search engine (Fig. 39-40), to display the sonogram and vowel space (Fig. 38), to display the informant biography, to print the documents and images, to go to the next or previous map, to zoom in and out the Veneto map, to configure the AMDV parameters, to switch between the orographic (Fig. 35a) and diocese maps (Fig. 35b), to hide all the list boxes, to choose the interface language. Below the buttons you find the string search box, and the list boxes with the lemma main index, the ethnographic audio comments, the AIS legends, the etymological files (Fig. 42). On the right side, you can see the list boxes to choose drawings, video interviews, pictures, and AMDV relevant documents, and to follow some discovery paths of AMDV audio-visual documents.

The AMDV software offers a computationally efficient phonetic search engine (Fig. 39-40), which is a salient feature for the AMDV database exploration. To facilitate the implementation of the related code, we avoided the use of a proprietary font to represent the phonetic symbols, which had two undesirable side effects: the need for a font installation step, and a much more complex search algorithm. The optimal solution was to represent the phonetic symbols with a standard Unicode font, installed by default on all the today PCs, and provided with a set of diacritics large enough to cover all the phones corpus. In this way, we can easily search for a string with, or without, diacritics, and without to know the specific character code. In the root search, the entire database could be examined in few seconds including, or omitting, hundred of phonetic variants. In the same way, you can search for the linguistic features associated with a diacritic symbol (nasality, degree of vocal closure and opening, lengthening, accent, etc., corresponding to the bottom green keys in Fig. 39-40), independently from the specific phones.

The AMDV also includes a database of all annotated 5870 objects, which could be ordered by item name, file name, author, AIS number of the locality, by AIS locality name, by date (Fig. 37).

The software provides a large number (70) of optional parameters, which allow: switching the interface default language (Italian or English); setting the directory and database paths; deciding which lemma transcriptions must appear; changing the visualization effects (zooming, fading, object contours, slide timing, full screen of images, etc.); setting the analysis parameters, etc., and, of course, resetting all the changed parameters to a default, and saving them.

Moreover, the user can manually configure all the elements (font size, color, and type, and color and dimension, etc.) of the AMDV interface with an initialization file

Fig. 37 – Searchable list of all AMDV annotated 5870 objects. The cells contain the direct links to the related object in the audio-visual documents.AMDV - Annotated items list

Fig. 38 – Sonogram, pitch and intensity contours in the word “ditale” (thimble) [el̝ ˈd̪jaˑal̝] by a female subject (mean pitch 213 Hz), from Raldon (VR). In the left window we can see a 150 ms /ja/ transition in 10ms steps, mapped in the Italian vowel space: the beginning of each arrow indicates the formant F1, F2 positions, the arrow length measures the instantaneous articulatory speed.AMDV - Sonogram and vowel space mapping

Fig. 39 – The AMDV database search for an interdental /θ/ followed by /i/.
The 41 cases matching the sequence are displayed on the left, highlighting the searched string. The results are delivered on demand in a HTML document (left). The user can search for non-contiguous sequences, and limit the look up to a single or multiple location; can require the exact matching, or accept all the diacritic variants.AMDV with phonetic search string

Fig 40 – Search of all VCV sequences with a bilabial occlusive (orange), followed by a high vowel (blue), and a nasal consonant (green), for a total of 420 search loops. The tree on the left allows the selection of a phonetic group. The 20 results (displayed pressing the "Results" button) are delivered on demand in an HTML file (left).AMDV with phonetic search groups

7.1. Heritage of Etymological Studies

One of the most significant task in the AMDV project regarded the preparation of the lexical, phonetic and etymological files related to the dialectal items, for it differentiated the AMDV from similar lexical atlas. The study had the aim to identify lemma convergences and divergences with respect to Italian, their areal distribution, and to highlight their historical significance (Fig. 42).

The research on the lexical types took into account both the results of AIS and AMDV investigations, displayed in the AMDV side by side, and exploited also the possibility to listen to the real sound source, allowing a better analysis of the relationship between the lexical, phonetic and phonological aspects. As told before, the straight comparison with the audio has a further advantage to make less definitive the possible transcription, which can be in this way modified and improved, and, moreover, helps to characterize the kind of auditory perception of the transcriber, which also in an excellent collector as Scheuermeier could lead to systematic errors.

Another important source of information for the lexical comments was both the Scheuermeier and Pellis pictures, and the Boesch drawings, which were used to disambiguate the identification of objects, and to understand, at the same time, their function and use.

The approach used in the lexical types research conforms an etymological-historical framework, derived from the A. Zamboni’s (Zamboni 1974), (Zamboni 1984) and G. Pellegrini’s studies, which were aimed to resolve in an exemplary way near and far etymologies of Veneto and Ladino lexicon.

Other landmarks were the studies of Kramer, in particular the Etymologisches Wörterbuch des Dolomitenladinischen (EWD) (Kramer 1991), and the Italian Etymological Lexicon (LEI) of M. Pfister and M. Schweickard (Pfister/Schweickard 1979- b).

Fig. 41 - Search results for the word “cesta” (basket) in all the AMDV documents. The selection of a cell allow to open the related image, picture, or document. On the bottom the related HTML page ready for the web access.AMDV with string search in all documents

7.2. Ethnographic Research

The AMDV reserves a particular emphasis to ethnographic research, because it complements in a functional and indivisible way the linguistic aspect.

In fact, the AIS turns out to be an inexhaustible mine of ethnographic information, which include the legends of the AIS maps, the Paul Boesch drawings (Fig. 2, 20, 42) and the Paul Scheuermeier pictures (Fig. 35a, 37, 45), and also the volume on the material culture published by Scheuermeier in 1943 (Scheuermeier 1943).

A philological approach was adopted to collect all the sources of information (paper documents, audio-visual testimonies, etc.), and original processing methodologies were applied to automate all the possible control loops to digitize, convert, transcribe, translate, and index the documents.

This approach regards the Interoperability required by the FAIR principles. In fact, we followed a criterion consistent with the word etymological meaning. In our opinion, the basic one, operability, must be achieved, before the consequent one, inter-operability. In other words, the system, that pretends to interact with the multidimensional World Wide Web, must be natively "operative", i.e. it must first ensure the FAIRness functionalities to the local environment.

This is the reason why NavigAIS and AMDV followed two implementation phases: first, as a batch version, and then, as an online application.

So the operability criterion applied in our projects, was to automate all the possible processing chains, which required none, or minimal, human supervision, and in particular:

  • The complete automation was realized for all the NavigAIS implementation phases, with the result that they could be executed (and inspected) one at a time, or carried out all together, without the need for any human intervention (Cap. 5.1).
  • The on-field audio acquisition automation (Cap. 5.2). A specific software, SyncRec, followed a user programmed sequence, displaying the image to be commented on the informant display, and, at the same time, showing the NavigAIS related map in the screen of the linguist, who was so able to interact effectively with the dialectal speaker (Fig. 17). For each entry in the list, the audio material was automatically labelled and saved with the same image name and a sequential number (for any possible repetitions), avoiding in this way the waste of time and valuable data in the recording and segmentation phase.
  • The automation in the creation of the AIS and AMDV database.
  • The automatic indexing of all the texts.
  • A supervised procedure to trace the contours of the interesting objects in the photos and drawing (Fig. 45).
  • The automatic phonetic inventory extraction based on the real data. This allows to automate the validation of the output OCR character sequence, marking the symbols not listed in the inventory (Fig. 46), and also to automate the error check of the database content (Fig. 47).
  • The automatic insertion of links between the lemmas saved in the database and the related physical position on the NavigAIS maps. The links allow to quickly inspect and correct the errors in the inventory and error lists of Fig. 46-47.

Fig. 42 – Example of a (partial) lexical-etymological document. It deals with the roots of dialectal terms to indicate “l’aratro” (the plow). The drawings are the work of Paul Boesch (1931).AMDV with etymological document

As seen in previous chapters, the primary source for all the three projects, the creation of a digital and navigable AIS atlas allowed la realization of the AMDV and the AIS Reloaded projects, which, in turn, will make the AIS database available.

All the AIS secondary sources have been acquired, including the text of the AIS legends, the AIS images (photos by Paul Scheuermeier and drawings by Paul Boesh) and the related annotation cards (Fig. 45), the Scheuermeier diary, the letters and postcard written by the AIS team, etc.

Other data have been collected from textual, graphic, photographic documents and videos of various origins. After the Alberto Zamboni death (25 January 2010, a few months after the start of the AMDV project), his fundamental essay on the Veneto Dialects (Zamboni, 1974) has been digitized, integrated in the AMDV atlas, and enriched with the 2009-2010 AMDV recordings (Fig. 50).

At low level, all the documents are formatted as standard HTML documents, using the Unicode (Universal Coded Character Set) and coded as Multi-Byte-Characters UTF-8 (8-bit Unicode Transformation Format), so that they become independent from the system that hosts them, and can be searched and delivered online without the need for conversion.

The same Unicode encoding was also applied to the linguistic information, that means a unique universal numeric identifier for each symbol and diacritic used, discarding the use of ASCII code, for it oblige to code a not Latin symbol with a combination of characters.

This approach greatly simplifies the software coding which:

  • Natively handles and returns Unicode strings.
  • Avoids conversion and interpretation of information.
  • Simplifies implementation of the indexing and retrieval engines.
  • Assures the FAIR properties of the exchanged documents.

The AMDV reproduces all the AIS drawings, and all the pictures and documents related to Veneto, as we can do in a traditional printed volume, but obviously exploits also the specific aspect of multimedia technology, capable to literally transform them in talking and navigable documents. We create a database with all the picture and drawing information: the author, the locality where the photo was taken, the Scheuermeier annotations, and moreover the graphical contour of all the relevant 5870 objects represented (Fig. 45), so to allow the mouse interactive exploration of the documents, in order to “discover” and listen to the related dialectal lemmas, or to listen to the comments made during the interviews.

The AMDV was also enriched with the postcards, regularly sent by Scheuermeier to Jaberg and Jud during on field inquiry in the Veneto region, which are a precious testimony of his linguistic and ethnographic adventures (and sometimes misadventures!).

The AIS atlas was not the unique ethnographic source in the AMDV, for we inserted the pictures made by Ugo Pellis for the Atlante Linguistico Italiano during the same period, and pictures and drawings found in public and private collections, including some very interesting documents given by the informants (one of them was the witness of the last days of Mussolini, by the driver of the car containing the famous “Dongo” treasure, which the fascist leader tried to carry with him in Switzerland).

It is necessary to mention the contribution of Serafina Prest (one of the AMDV informants), who created an extraordinary series of drawings illustrating all the aspects of the life in her native village, Losego (Fig. 43), with the children's games, the work in the fields, and the reproduction one by one of the Losego village courtyards (Fig. 44).

The AMDV allows the information retrieval in all the audio-visual and textual sources, and the display of the results in an indexed table (Fig. 36), which can be sorted by columns, and gives direct access to different document sources.

7.3. Dialect Phonetic Characterization

As regards the features useful to characterize the dialectal lemmas from a acoustical and phonetic point of view, the AMDV provides the traditional representations (sonogram, pitch and intensity contours, formant extraction, and the phonetic segmentation, Fig. 38), but provides also an F1-F2 formant plot to map the dialectal sounds on the Italian vowel space (left top window, Fig. 38), based on the Franco Ferrero fundamental research for Italian vowels at ISTC7. The user has the possibility to choose the metric of the vowel space between linear Hz, Bark (corresponding to the 24 critical bands of hearing, proposed by E. Zwicker in 19618), or Erb (the Equivalent Rectangular Bandwidth scale by B. Moore and B. Glasberg, 19969).

It is also planned in the near future the segmentation and phonetic labelling of all the sound files to visualize in the sonogram window. Some experimental tests have been made with Sonic, a quite sophisticated speech recognizer by Brian Pellom10, developed at the Colorado University, and tested with very promising results. In this particular case, Sonic is obviously not used to recognize a dialectal speech, but to find the phonetic segments and word boundaries by an automatic processing that don't require human intervention.

Fig. 43 - Serafina Prest - “il bucato” (the laundry), when there were no washing machines.AMDV - Drawing by Serafina Prest

Fig. 44 - Serafina Prest - Courtyards of the Losego village.AMDV - Drawing by Serafina Prest

7.4. Development Outlook

The most obvious continuation of theAMDV project will be the creation of an AMDV online version, which would allow the web interaction, according with the current scientific and technological developments, and in agreement with the FAIR principles. The audio-visual resources have been prepared to assure the appropriate access and delivery to the web, maintaining in the same time the optimal quality of the AMDV materials and functionalities.

Of course, the metadata structures, and the web query protocols remain to be improved and completed according to the recent theoretical and technological progress, and the W3C (World Wide Web Consortium) recommendations. The Web Services, i.e. software designed to support interoperability between different computers, are quickly developing simpler REST11 compliant systems, which use uniform syntax, self-descriptive metadata, and explicit typing for message structure, fields, etc., and which can expose an arbitrary set of facility operations.

As minimal example of the REST approach, the syntax of the input/output arguments in a query string to obtain the results as in the Fig. 39-40-41 documents could be expressed with self-explicative unambiguous metadata:

"Search_string": {
"inputs": [
{"name": "string_to_search",
 "type": "char",
 "purpose": "Search the string in the database"}
],
"outputs": [
{"name": "sorted_output",
 "type": "string array",
 "purpose": "Return the sorted list of database entries containing the searched string"}
]
}

Fig. 45 - Scheuermeier photo taken on April 3, 1921, in Cerea (Verona). The objects of linguistic interest are annotated and marked with a polygonal contour to allow the picture exploration. In the top right corner the card by Robert, son of Paul Scheuermeier (in the left corner the corresponding AMDV HTML document, web ready).AMDV - Scheuermeier photo n. 388

Fig. 46 - AIS phonetic inventory extraction from the real data.

The following lines are the partial output of an automatic procedure for extracting all the phonetic symbols of the items acquired by the NavigAIS OCR (Ch. 6), and saved in the AISr database. The second column shows the AIS symbol, followed by the occurrence number, the hexadecimal and decimal code of the characters and diacritics, and a link to graphic example(s) on the AIS maps.

......................
79 - [ ć ] n. occ.: 606 - code: [Hex: 0063 0301 - Dec.: 99 769 ] link: www3.pd.istc.cnr.it/navigais-web/?map=1187&point=715
80 - [ ć̣ ] n. occ.: 9 - code: [Hex: 0063 0301 0323 - Dec.: 99 769 803 ] link: www3.pd.istc.cnr.it/navigais-web/?map=1187&point=322
81 - [ ć̨ ] n. occ.: 1 - code: [Hex: 0063 0301 0328 - Dec.: 99 769 808 ] link: www3.pd.istc.cnr.it/navigais-web/?map=39&point=333
82 - [ ć̩ ] n. occ.: 3 - code: [Hex: 0063 0301 0329 - Dec.: 99 769 809 ] link: www3.pd.istc.cnr.it/navigais-web/?map=32&point=625
84 - [ c̋ ] n. occ.: 41 - code: [Hex: 0063 030B - Dec.: 99 779 ] link: www3.pd.istc.cnr.it/navigais-web/?map=1647&point=319
85 - [ c̩̋ ] n. occ.: 1 - code: [Hex: 0063 030B 0329 - Dec.: 99 779 809 ] link: www3.pd.istc.cnr.it/navigais-web/?map=357&point=29
86 - [ c̩ ] n. occ.: 1 - code: [Hex: 0063 0329 - Dec.: 99 809 ] link: www3.pd.istc.cnr.it/navigais-web/?map=39&point=330
87 - [ d ] n. occ.: 1685 - code: [Hex: 0064 - Dec.: 100 ] link: www3.pd.istc.cnr.it/navigais-web/?map=1647&point=818
88 - [ ḍ ] n. occ.: 86 - code: [Hex: 0064 0323 - Dec.: 100 803 ] link: www3.pd.istc.cnr.it/navigais-web/?map=1647&point=875
90 - [ e ] n. occ.: 387 - code: [Hex: 0065 - Dec.: 101 ] link: www3.pd.istc.cnr.it/navigais-web/?map=1647&point=943
91 - [ è ] n. occ.: 20 - code: [Hex: 0065 0300 - Dec.: 101 768 ] link: www3.pd.istc.cnr.it/navigais-web/?map=1187&point=553
92 - [ ẹ̀ ] n. occ.: 31 - code: [Hex: 0065 0300 0323 - Dec.: 101 768 803 ] link: www3.pd.istc.cnr.it/navigais-web/?map=1647&point=581
93 - [ ę̀ ] n. occ.: 35 - code: [Hex: 0065 0300 0328 - Dec.: 101 768 808 ] link: www3.pd.istc.cnr.it/navigais-web/?map=1647&point=545
94 - [ è̩ ] n. occ.: 1 - code: [Hex: 0065 0300 0329 - Dec.: 101 768 809 ] link: www3.pd.istc.cnr.it/navigais-web/?map=32&point=354
95 - [ è͈ ] n. occ.: 1 - code: [Hex: 0065 0300 0348 - Dec.: 101 768 840 ] link: www3.pd.istc.cnr.it/navigais-web/?map=712&point=271
96 - [ é ] n. occ.: 67 - code: [Hex: 0065 0301 - Dec.: 101 769 ] link: www3.pd.istc.cnr.it/navigais-web/?map=1647&point=792
97 - [ ẹ́ ] n. occ.: 300 - code: [Hex: 0065 0301 0323 - Dec.: 101 769 803 ] link: www3.pd.istc.cnr.it/navigais-web/?map=1647&point=765
98 - [ ẹ̫́ ] n. occ.: 1 - code: [Hex: 0065 0301 0323 032B - Dec.: 101 769 803 811 ] link: www3.pd.istc.cnr.it/navigais-web/?map=48&point=329
99 - [ é̤ ] n. occ.: 9 - code: [Hex: 0065 0301 0324 - Dec.: 101 769 804 ] link: www3.pd.istc.cnr.it/navigais-web/?map=430&point=14
100 - [ ę́ ] n. occ.: 362 - code: [Hex: 0065 0301 0328 - Dec.: 101 769 808 ] link: www3.pd.istc.cnr.it/navigais-web/?map=1647&point=865
101 - [ é̩ ] n. occ.: 2 - code: [Hex: 0065 0301 0329 - Dec.: 101 769 809 ] link: www3.pd.istc.cnr.it/navigais-web/?map=712&point=44
102 - [ é͈ ] n. occ.: 12 - code: [Hex: 0065 0301 0348 - Dec.: 101 769 840 ] link: www3.pd.istc.cnr.it/navigais-web/?map=1647&point=708
.....................

Fig . 47 - Output of the Automatic Procedure to Check the Errors in the AIS Database

The figure shows the partial output from an automatic check of the content of the AIS Reloaded database (created by the OCR described in Ch. 6). The related link brings to the AIS map and location, to inspect the phonetic sequence, and modify, when necessary.

In certain cases, the software can automatically operate the correction as in the last line (map 192 point 947 :

192.947 - [ aϑϑoppiçā́re ] Illegal symbol(s): Substitute [ ç ] with [ ʕ ] www3.pd.istc.cnr.it/navigais-web/?map=192&point=947

.....................
Map: K0014 - Char n.= 12137 - Entries n.= 395
...
14.3 - [ sǭrαs̩ ] Check accent: www3.pd.istc.cnr.it/navigais-web/?map=14&point=3
14.11 - [ tųαyα ] Check accent: www3.pd.istc.cnr.it/navigais-web/?map=14&point=11
14.58 - [ suręllį ] Check accent: www3.pd.istc.cnr.it/navigais-web/?map=14&point=58
14.314 - [ tuǜα sọ́; tuǖ̀s sọruǖ́s̩ ] Illegal symbol(s): Substitute [ ü ] with [u] + diacritic: www3.pd.istc.cnr.it/navigais-web/?map=14&point=314
...

Map: K0100 - Char n.= 7001 - Entries n.= 370
...
100.243 - [ ind ̨el pų̄́ls ] Check diacritic sequence: www3.pd.istc.cnr.it/navigais-web/?map=100&point=243
100.311 - [ bọwṣį ] Check accent: www3.pd.istc.cnr.it/navigais-web/?map=100&point=311
...

Map: K0105 - Char n.= 8767 - Entries n.= 396
...
105.142 - [ lāvre ] Check accent: www3.pd.istc.cnr.it/navigais-web/?map=105&point=142
105.190 - [ lä́rfi; ̥ɳ lǟ́rfu ] Check diacritic sequence: www3.pd.istc.cnr.it/navigais-web/?map=105&point=190
...

Map: K0191 - Char n.= 6622 - Entries n.= 369
...
191.58 - [ tsǫp ] Check accent: www3.pd.istc.cnr.it/navigais-web/?map=191&point=58
191.327 - [ s̑wẹ́t; -ẹ́s̑ ̣m. p. ] Check diacritic sequence: www3.pd.istc.cnr.it/navigais-web/?map=191&point=327
191.716 - [ tso pp ] Illegal special char(s) [26, 26,]: www3.pd.istc.cnr.it/navigais-web/?map=191&point=716
...

Map: K0192 - Char n.= 7993 - Entries n.= 360
...
192.73 - [ andā ] Check accent www3.pd.istc.cnr.it/navigais-web/?map=192&point=73
192.947 - [ aϑϑoppiçā́re ] Illegal symbol(s): Substitute [ ç ] with [ ʕ ] www3.pd.istc.cnr.it/navigais-web/?map=192&point=947
.....................

8. Difficulties Encountered, Solutions, and Results

The methodology developed to realize the AMDV goals can be briefly summarized in the following table that shows for each topic the related problems, the adopted criteria and the given solution(s):

AMDV - Difficulties and solutions

As regards the difficulties, a significant delay in the planned project development was due to the sudden loss in January 2010 of Alberto Zamboni. This unexpected event caused the lack of an irreplaceable linguistic-etymological expertise, and also a series of institutional-administrative consequences for the partners of the project (Padua University and ISTC), that lasted for several months.

A second source of trouble, as regards the ethnographic aspect, was the very limited time that the expert in charge could devote to the project, and that was solved with the help of Carla Gentili, Scheuermeier and AIS study specialist, and the generous cooperation of institutes and people, as the researchers of the Atlante Linguistico Italiano; the Museo Etnografico della Provincia di Belluno; the Bern AIS archives.

As regards the problem to find the good informants, we must add that the selection process followed different ways: linguists, students, amateurs, friends, priests, majors, etc., and in some occasion reserved surprises: for example, a 84 old, very interesting subject from the dialectal point of view, was totally incapable to sustain the effort of the interview for more than 1 hour; on another occasion, we found at the very beginning of the interview, that the person was incapacitated to speak fluently, having been subjected to a stroke. These misadventures are the consequence of the aging of the Italian population in recent decades, as it is possible to see in Fig. 49: the AIS informant list for the Veneto region in 1921 investigation, and in 2009-2010 shows that the average age of the speakers is respectively 55 years for 1921, and 73 years (18 years older) for 2009-2010.

The experience proved that the probability of discovering a candidate, who would fit the prerequisites, was inversely proportional to the people number which had led to the speaker, and, finally, that the best way was to go on site and personally talk with different possible speakers.

Fig. 49 - AIS and AMDV informant overview.50_AMDV_Informants

The NavigAIS, AMDV and AISr projects followed three sources of inspiration:

  • A more traditional approach on the basis of the geolinguistic framework, to study the linguistic transformations occurred in the Veneto and Southern Switzerland dialects between the 1921 and 2009-2010.
  • A more innovative and experimental approach to develop automatic or supervised procedures to collect large amount of linguistic data, to implement the databases and the related search mechanisms, to segment the audio material and to extract the relevant speech parameters and phonetic mapping data.
  • A tentative to build these resources in accordance with the FAIR principles.

The results obtained go beyond the initial expectations and prove that the synergic combination of these approaches can be very effective to obtain results in a quite limited time, and time could be a decisive factor in the case of dialects and languages in danger of extinction.

First of all, AMDV introduced a relevant methodological innovation with on field real time control of the informant responses. Moreover, the AMDV approach was not limited to solve a contingent problem, but planned to serve the general interest of the linguistic community. As told before, while the innovation of these linguistic approach could be obviously questionable, the projects have the merit of the creation of some working instruments, derived from the contingent necessities, but designed to be useful independently from the original contexts.

Following these inspirations, we decided to realize the digital navigable version of the entire AIS atlas, and the SyncRec software.

From the linguistic point of view, the main characteristic of AMDV is the fusion of a set a powerful features (diachronic comparison of the AIS and 2009-2010 lexical corpus, search engine, phonetic tools, etc.), and the integration of the lexical, phonetic, etymological and ethnographic documents, in an homogeneous environment that can express the complexity and richness of the material culture.

Finally, the most significant result obtained so far is the open access to the AIS corpus given by the AIS Reloaded project. The task to indexing the 1.4 M words of the AIS atlas, impossible at the Jaberg and Jud times, has finally been accomplished.

To conclude, the NavigAIS, AMDV and AIS Reloaded projects are a little, but precious, contribution to rediscover, and save from oblivion, the precious heritage of our cultural roots.

Fig. 50 - The digital edition of the Zamboni essay on Veneto dialects with speech sound examples from the AMDV.49_AMDV_Zamboni_Veneto_Book

9. Acknowledgments

Thanks to:

• Alberto Zamboni, who dedicated the last days of his life to the AMDV project.
• Michele Loporcaro and Stephan Schmid, who believed in the enterprise to digitize the entire AIS.
• All the AIS Reloaded team (Chiara Zanini, Giulia Donzelli, Stefano Negrinelli), and Giacomo Ferrieri, for their help in preparing the AIS phonetic inventory.
• Alberto Benin (ISTC), who, despite the Covid-19 pandemic, assured the ISTC server configuration for the NavigAIS needs.

Graziano Tisato (AMDV project & direction, database and software implementation); Maria Teresa Vigolo, Paola Barbierato (etymological research); Alberto Zamboni, Laura Vanelli, Enzo Croatto, Vincenzo Galatà, Giovanni Tomasi, John Trumper, Giorgio Vedovelli (linguistic supervision); Vincenza Castellani (AIS lemma transcription); Giacomo Ferrieri (phonetic transcription); Carla Gentili (AIS legend translation); Daniela Perco, Glauco Sanga (ethnographic consultants).
Roberts L. (1963), Machine Perception of Three-Dimensional Solids, Garland Publ., New York
Vithlani P., Kumbharana C.K. (2015), Comparative Study of Character Recognition Tools, International Journal of Computer Applications (0975 – 8887), Vol. 118 – N. 9.
Levenshtein V. (1966), Binary codes capable of correcting deletions, insertions, and reversals, in Soviet Physics Doklady, Vol. 10, 1966, 707–10.
see note 1
The AMDV documents contain: 2 dialectal databases (11650 AIS lemmas, and 14521 AMDV lemmas with IPA and AIS-like transcription); 355 lexico-etymological files (which exploit the patrimony of the G. B. Pellegrini and A. Zamboni school, Padua University); 2300 ethnographic comments and related transcriptions; 580 pictures by P. Scheuermeier and U. Pellis, 647 drawings by P. Boesch et al.; 5870 annotated objects; 355 AIS legends; 71 videos with the interviews to P. Barbierato, C. Gentili, G. Sanga, J. Trumper, and the biographic clips of the 26 AMDV informants, and some videos of ethnographic documentation; 12 paintings on the Veneto material culture by the painter L. Viola; 2 folk songs (Venice and Chioggia).
Ferrero F., Genre A., Boe L.J., Contini M. (1979), Nozioni di fonetica acustica, Omega, Torino.
Zwicker E. (1961), Subdivision of the audible frequency range into critical bands, in The Journal of the Acoustical Society of America, 33.
Moore B., Glasberg B. (1996), A revision of Zwicker's loudness model, in Acta Acustica, vol. 82, 335-345.
Pellom, B. (2001), Sonic: The University of Colorado continuous speech recognizer, in Technical Report TR-CSLR-2001-01, University of Colorado, USA.
REST: Representational State Transfer based communications (Fielding R. - Taylor R. (2002), Principled Design of the Modern Web Architecture (PDF), in ACM Transactions on Internet Technology, Vol. 2, n. 2, New York, Association for Computing Machinery, 115–150).

Bibliography

  • Goebl 1998 = Goebl, Hans (1998): Atlant linguistich dl ladin dolomitich y di dialec vejins, 1a pert/Atlante linguistico del ladino dolomitico e dei dialetti limitrofi, 1a parte/Sprachatlas des Dolomitenladinischen und angrenzender Dialekte, 1. Teil, Wiesbaden, Reichert.
  • Jaberg/Jud 1928 = Jaberg, Karl / Jud, Jakob (1928): Der Sprach- und Sachatlas als Forschungsinstrument. Kritische Grundlegung und Einführung in den Sprach- und Sachatlas Italiens und der Südschweiz, Halle (Saale), Niemeyer.
  • Jaberg/Jud 1928-1940b = Jaberg, Karl / Jud, Jakob (Hrsgg.) (1928-1940): Sprach- und Sachatlas Italiens und der Südschweiz (AIS), vol. 8, Zofingen, Ringier .
  • Jaberg/Jud 1960 = Jaberg, Karl / Jud, Jakob (1960): Index zum Sprach- und Sachatlas Italiens und der Südschweiz, Bern, Stämpfli & Cie.
  • Kattenbusch 1998-2016b = Kattenbusch, Dieter (Hrsg.) (1998-2016): Vivaio Acustico delle Lingue e dei Dialetti d'Italia, Berlin, Humboldt-Universität, Ultimo accesso 12/01/2019 ore 10:34 (Link).
  • Kramer 1991 = Kramer, Johannes (1991): Etymologisches Wörterbuch des Dolomitenladinischen (EWD), vol. IV, Hamburg, Helmut Buske.
  • Krefeld/Lücke 2019 = Krefeld, Thomas / Lücke, Stephan (2019): FAIRNESS – Medien im methodologischen Zentrum der Geolinguistik, in: Berichte aus der digitalen Geolinguistik (II), vol. KIT 9 (Link).
  • Loporcaro 2009 = Loporcaro, Michele (2009): Profilo linguistico dei dialetti italiani, Bari/Roma, Laterza.
  • Loporcaro/Schmid/Pescarini/Tisato/Donzelli/Negrinelli/Zanini 2019 = Loporcaro, Michele / Schmid, Stephan / Pescarini, Diego / Tisato, Graziano / Donzelli, Giulia / Negrinelli, Stefano / Zanini, Chiara (2019): AIS, reloaded (AISr), Zurich, University of Zurich, University of Zurich (Link).
  • Lücke 2019 = Lücke, Stephan (2019): Principi FAIR, in: Metodologia, VerbaAlpina-it 19/1 (Link).
  • Mott/Kezich/Tisato 2003 = Mott, Antonella / Kezich, Giovanni / Tisato, Graziano (2003): Il Trentino dei contadini. Piccolo Atlante sonoro della cultura materiale. Le parole e le cose della ricerca di Paul Scheuermeier (1921/1931) e la voci della tradizione di oggi (1998), Museo degli Usi e Costumi della Gente Trentina, San Michele all’Adige (TN), CDRom - Sistema operativo Windows 95, 98, 98SE, ME, 2000, XP, NT (Link).
  • Negrinelli 2019 = Negrinelli, Stefano (2019): Il progetto AIS reloaded: un archivio sonoro per 36 varietà dialettali della Svizzera meridionale, Arezzo, Proc. XV Conf. AISV 2019, In Print (Link).
  • Pfister/Schweickard 1979- b = Pfister, Max / Schweickard, Wolfgang (1979-): Lessico etimologico italiano, bisher 12 Bände, Wiesbaden, Reichert (Link).
  • Scheuermeier 1943 = Scheuermeier, Paul (1943): Bauernwerk in Italien, der italienischen und rätoromanischen Schweiz, vol. 1-2, Rentsch, Erlenbach-Zürich.
  • Telmon/Canobbio 1985 = Telmon, Tullio / Canobbio, Sabina (Hrsgg.) (1985): Atlante Linguistico ed Etnografico del Piemonte Occidentale - ALEPO, Torino, CELID, Regione Piemonte.
  • Tisato 2010 = Tisato, Graziano (2010): NavigAIS – AIS Digital Atlas and Navigation Software, Naples, Bulzoni ed., Roma, Proc, IX Conf. AISV 2010, 451-461 (Link).
  • Tisato 2015 = Tisato, Graziano (2015): Documenti etnolinguistici navigabili e ‘parlanti’: l’approccio di NavigAIS e dell’Atlante Multimediale dei Dialetti Veneti, vol. Atti del Convegno Intern. di Studi Archivi Etnolinguistici Multimediali, Vol. 39, Torino, F. Cugno, L. Mantovani, M. Rivoira., eds, Bollettino dell'Atlante Linguistico Italiano, Proc. of “Lingue e le Culture della Montagna”, 59-82.
  • Tisato 2019 = Tisato, Graziano (2019): Acquisizione Digitale dell’Intero AIS, Arezzo, Proc. XV Conf. AISV 2019, 131-153, In Print (Link).
  • Tisato/Barbierato/Ferrieri/Gentili/Vigolo 2013 = Tisato, Graziano / Barbierato, Paola / Ferrieri, Giacomo / Gentili, Carla / Vigolo, Maria Teresa (2013): Atlante Multimediale dei Dialetti Veneti, Venezia, Bulzoni ed., Roma, Proc. IX Conf. AISV 2010, 445-462 (Link).
  • Tisato/Vigolo 2016 = Tisato, Graziano / Vigolo, Maria Teresa (2016): Dagli Atlanti storici agli Atlanti multimediali: il NavigAIS e l'AMDV (Atlante Multimediale dei Dialetti Veneti), in: Francesco Avolio e Antonino Cigno ed. - Atti del Convegno Internazionale di Studi - Archivi Etnolinguistici Multimediali, Pescara, Museo delle Genti d’Abruzzo, 96-123 (Link).
  • Wilkinson u.a. 2016 = Wilkinson, Mark D. / Dumontier, Michel / Aalbersberg, IJsbrand Jan / Appleton, Gabrielle / Axton, Myles / Baak, Arie / Blomberg, Niklas / Boiten, Jan-Willem / da Silva Santos, Luiz Bonino / Bourne, Philip E u.a. (2016): The FAIR Guiding Principles for scientific data management and stewardship, in: Scientific data, vol. 3, Nature Publishing Group (Link).
  • Zamboni 1974 = Zamboni, Alberto (1974): Veneto, Pisa, Pacini Editore, Profilo dei dialetti italiani 5.
  • Zamboni 1984 = Zamboni, Alberto (1984): I dialetti cadorini, in: B. Pellegrini e S. Sacco ed., Il ladino bellunese. Atti del Convegno Internazionale (Belluno 2-4 giugno 1983). , Belluno, 45-83.
  • Zamboni/Vigolo = Zamboni, Alberto / Vigolo, Maria Teresa: Tra nomi e cose. Commenti lessicali e onomasiologici allo Scheuermeier veneto, in: Perco D., Sanga G., Vigolo M. T., Ed., Paul Scheuermeier, Il Veneto dei contadini 1921-1932, Vicenza, 67-87.