Standardizing a Standard: Why and how a Best Practice Guide for the DataCite Metadata Schema was created



Version:



Zitation:
  1. Referenz auf den gesamten Beitrag:
    Sonja Kümmet & Stephan Lücke & Julian Schulz & Martin Spenger & Tobias Weber (2020): Standardizing a Standard: Why and how a Best Practice Guide for the DataCite Metadata Schema was created, Version 1 (20.01.2020, 13:47). In: Korpus im Text, Serie A, 51272. url: http://www.kit.gwi.uni-muenchen.de/?p=51272&v=1
    Diese URL enthält einen Hinweis auf die unveränderliche Version (…v=nn)
  2. Referenz auf einen Abschnitt oder Nachweis eines Zitats: http://www.kit.gwi.uni-muenchen.de/?p=51272&v=1#p:1
    Diese URL enthält einen Hinweis auf einen spezifischen Abschnitt (…p:1). In diesem Fall ist sie mit dem ersten Absatz verknüpft. Eine vollständige Referenz kann jeweils mit dem Zitationssymbol zu Beginn jedes Absatzes abgegriffen werden.
Abstract

In order to promote the FAIRness of research data, the use of a widespread metadata schema is recommended to describe the data. The DataCite Metadata Schema published by the consortium of the same name has meanwhile established itself as a model used worldwide. However, the evaluation of DataCite XML files created by project managers at the IT Group Humanities of the LMU Munich and at the Leibniz Supercomputing Centre revealed the need to extend the standard. Against this background, representatives of data creators, data curators and data aggregators participated in the development of a best practice guide for DataCite in order to increase the interoperability of (meta-)data through a stronger standardization. This paper describes the development process towards the now published Best Practice Guide, discusses the reasons for its development, and presents the main features of the guide and the potential of its future application.

1. Introduction

The DataCite Metadata Schema has become a widely used standard for describing research data. DataCite is used in a variety of research data infrastructures and repositories.1In addition to the Metadata Search provided by DataCite, for example platforms like Zenodo, GeRDI and the research data repository Open Data LMU provided by the Ludwig-Maximilians-Universität München. To use a common metadata schema is not sufficient to achieve interoperability: Missing guidelines for the creation of metadata lead to ambiguities and thus to difficulties in aggregation and automated processing of research data – all to the disadvantage of academic end users.2Brase u.a. 2015, 4.

To ensure that data are Findable, Accessible, Interoperable and Re-useable in accordance with the FAIR principles,3Wilkinson u.a. 2016. the structured and standardised collection of their metadata plays a decisive role.4Brase u.a. 2015. The Best Practice Guide for the DataCite Metadata Schema presented here aims to achieve this:5And thus follows the recommendations of the position paper of the DHd Working Group Data Centres (DHd AG Datenzentren 2018, 25f.). It supports  scientists in describing their research data. Operators of data repositories can use it as a stimulus for a more conscious kind of metadata integration. Last but not least, we hope that this publication contributes to a more FAIR approach to research data and raises the awareness of the research community for the importance of a streamlined usage of metadata, especially with regard to interoperability.

This article first refers to current developments with regard to DataCite (Section 2). Subsequently (section 3), different types of metadata schemes and their added value for research are presented. Section 4 focuses on the DataCite Metadata Schema, for which the Best Practice Guide was designed. The initial situation (section 5), against which the development of the guide was based, is the basis for the discussion of the reasons for this undertaking (section 6). The two subsequent sections (7 and 8) present the development process including the participants involved and the design of the guide, respectively.

2. DataCite in the Context of Current Developments

Since its establishment in 2009, the international DataCite-consortium has played a major role in promoting the cause of research data management in the sciences and humanities. Especially the publication of the DataCite Metadata Schema (DataCite Metadata Working Group 2019) with the same name, as well as recommendations on its use, for example regarding the use of persistent identifiers for research data (Rueda u.a. 2016), are examples of the importance of DataCite. The webinar series presenting the metadata schema reaches out to repository operators and researchers alike (Rueda 2016). The positive effect of the cooperation of researchers, scientists and librarians is often emphasized (Müller 2019), as can be seen from a review of five years of DataCite and ten years of Digital Object Identifier (DOI) (Brase u.a. 2015).

DataCite is used worldwide as a common standard. The metadata scheme is now internationally recognized, e.g. in Japan (Takeda 2015) and Korea (Kim u.a. 2017), or at the University of Michigan with its own task force (Álvarez u.a. 2013). National consortia have also been established, such as the DataCite Estonia Consortium, DataCite Netherlands or DataCite UK. In addition to taking regional aspects into account when awarding DOIs, these consortia are also working towards standardizing the metadata used. The community in German-speaking countries is also increasingly relying on DataCite, as evidenced by reports and activities by da|ra (Helbig u.a. 2015) and ETH Zurich (Hirschmann 2015). DataCite is also seen as an important building block for research data management (RDM) in scientific libraries (Pletsch u.a. 2018).

The DataCite metadata scheme (DataCite Metadata Working Group 2019) serves as the basis for this Best Practice Guide, other important sources are the CrossRef documentation (CrossRef 2019) and the DOI Handbook (International DOI Foundation 2017). The extensive documentation of the DataCite Metadata Schema, comprises more than 70 pages and provides detailed instructions for completing the metadata fields. However, this is not a guarantee to generate interoperable metadata. The broad variety of international DataCite reports (cited above)  was also an important source for the development of the guide. Of central importance is the fact that the metadata scheme is constantly evolving (Starr/Gastl 2011). The working groups Metadata Working Group and Policy and Best Practices Working Group (this group is currently inactive) enabled a multitude of developments in the metadata schema. The presented Best Practice Guide ties in with this practice and is primarily addressed to data producers, but also to infrastructure providers.

The PID Forum has established itself as an exchange platform for topics relating to persistent identifiers (PID). It not only offers reports on experiences and concrete case studies, but also promotes the networking of institutions in this field. The forum was also a useful source for the development of the Best Practice Guide; the use of established PID systems for the reliable linking of distributed data sets is also recommended in the presented Best Practice Guide.

3. Importance of metadata for research

In order to publish research data, there are subject-specific and interdisciplinary offers for researchers:6For the various types of repositories, consult DHd AG Datenzentren 2018, 20-23. Subject repositories offer the possibility to make data of a certain subject domain accessible to the community (e.g. GESIS for the Social Sciences, OstData for Eastern, Eastern-Central and South-Eastern European Studies). Interdisciplinary repositories are subdivided into site-specific offerings, for example for members of a university (e.g. Open Data LMU) and research data infrastructures across different locations. „Research data infrastructure“ as a term covers both services for the independent publication of research data (e.g. Zenodo, DARIAH-DE Repositorium7Currently available in an advanced beta version.) and meta-search engines that operate beyond the boundaries of repositories (e.g. Generic Research Data Infrastructure (GeRDI)8Bode u.a. (2017-); cf. Grunzke u.a. 2017.).

Despite – or perhaps because of – this diversity of possibilities for the publication of research data, the following applies to all services: The use of metadata is indispensable to ensure that research data can be used sustainably, i.e. over a long period of time, by a broad (scientific) public and to make it accessible for automated processing by technical systems. Due to their abstraction, metadata enable research data to be found, referenced and re-used independently of their specific structure and physical storage location. As machine-readable descriptions of data, they play an important role in the indexing of data sets and contribute significantly to increasing the efficiency of research.9Grunzke 2016, 2.

Metadata support the selective retrieval of relevant data sets for current or future research projects. In some cases this can lead to new research questions.10Franzke 2017, 5. Information on software, methods, and models used to create the data can be a helpful facet to search by. Last but not least, a reference to the underlying licensing model in the metadata contributes to legal certainty in science and beyond. In order to develop the potential for research in the best possible way, the consistency of metadata plays a decisive role, as does the choice for a widely applied metadata standard.11Cf. Note 1. Only in this way can heterogeneous resources be linked with each other. With the help of semantic web technologies and linked open data procedures innovative usage scenarios are facilitated.12Pohl/Danowski, 392-408, bes. 393f.

Metadata can also help to integrate data into conventional library catalogues (OPAC). This leads to an increased visibility of scientific results by a wider community and thus – in line with the idea of open access – to a democratization of knowledge.

Out of the different types of metadata,13Cf. Rühle 2012b. the following section will focus on the descriptive metadata. These can be subdivided into descriptive metadata and reference metadata:

  • Descriptive metadata are used for the formal indexing of research data. They contain, for example, the title, information on the authorship of the data, the context in which it was created, and the institutional relations of the underlying research project. Well-known standards are the Dublin Core format, provided since 1994 by the Dublin Core Metadata Initiative (DCMI), and the more comprehensive DataCite Metadata Schema.
  • Reference metadata contain information obtained by analyzing the relationships within the resource itself. For example, reference metadata allow to model actors, objects and relations involved in the creation of an ancient site. By a comprehensive indexing of content, such subject-specific information becomes searchable. A well-known and powerful ontology for such concepts of cultural heritage is CIDOC CRM, including many extensions for certain sub-disciplines.14 For a list of extensions compatible with CIDOC CRM see: http://www.cidoc-crm.org/collaborations. The Europeana Data Model (EDM) has also established itself as a standard for the content indexing of cultural data.

For the best possible indexing of research data, the combination of descriptive and reference metadata models is recommended. This combination serves as single entry point for searches of different target groups.15Gradl u.a. 2015, special section 2.

This paper focuses on the DataCite Metadata Schema and thus on the generic component of metadata indexing. As a granular metadata schema, DataCite already has some characteristics similar to reference metadata. This, of course, does not make the additional use of content-enabling models obsolete. However, the careful collection of research data in DataCite allows repositories that do not (yet) support subject catalogued models to provide at least basic information about the content of the data. In addition, even a basic form of subject cataloguing can lead to members of other scientific domains being made aware of data sets whose relevance would not have been evident without information on the content.16Pempe 2012, 138.

The DataCite Metadata Schema offers good cross-linking  options, especially if there is no existing in-depth content indexing of the data available. In this way, it promotes an interdisciplinary contextualization of research results. DataCite offers further advantages over other generic metadata models, which are briefly explained in the following section.

4. The DataCite Metadata Schema

Providing the DataCite Metadata Schema is one of the core tasks of DataCite and the Metadata Working Group. The development of the metadata schema to date can be easily traced through the DataCite website: Version 2.0 of the metadata schema from January 2011 was replaced by version 3.0 in July 2013. Currently, all versions from 3.0 onward are supported. Version 4.3 is the latest version and also used in this Best Practice Guide. The Metadata Working Group is currently working on version 5.0. After its release, this Best Practice Guide will be adapted accordingly.

A key factor of DataCite services is the concept of the persistent identifier. DOIs allow a unique and URL-independent assignment of a digital resource and are used to identify a resource described by DataCite metadata. Besides CrossRef, DataCite is one of the most widely used institutions for the registration of DOIs. The role that DOI registration plays in the dissemination of the standard is illustrated by the number of registered metadata records: In autumn 2019, there were a total of more than 19 million registrations via DataCite17Cf. the statement in „completeListSize“ at: https://oai.datacite.org/oai?verb=ListIdentifiers&metadataPrefix=oai_dc, which enable users, repository operators and data aggregators to ensure the citation of scholarly output on a permanent basis.

There are several large registration agencies in Germany that offer DOI services. In addition, institutions have the possibility of a direct membership in DataCite, which includes an independent administration of the DOI allocation and enables direct cooperation in the DataCite community.18Out of the institutions contributing to the Best Practice Guide, the University Library of the LMU has been an independent member of DataCite since July 2019 and registers DOIs under the prefix 10.5282. Previously, it used the DOI Service of the Technische Informationsbibliothek Hannover (TIB) for several years.

DOIs can either use the DOI Fabrica registration form19For information on DOI Fabrica see DataCite Roadmap: https://datacite.org/roadmap.html. However, not all metadata fields are considered here (see Ticket: „Support all metadata fields in the DOI registration form“)., or can be registered via an application programming interface. The basic prerequisite for this is compliance with the scheme structure and consistent use of the fields. DataCite divides the latter into the following categories:20See also section „DataCite Best Practice Guide“.

  • Mandatory
  • Recommended
  • Optional

DataCite pursues the following objectives with the metadata scheme:21Cf. 1.2 in https://schema.datacite.org/archive/kernel-2.2/doc/DataCite-MetadataKernel_v2.2.pdf

  • Recommending a standard citation format for datasets, based on a small number of properties required for identifier registration;
  • providing the basis for interoperability with other data management schemas;
  • promoting dataset discovery with optional properties allowing for flexible description of the resource, including its relationship to other resources;
  • providing the basis for future services (e.g., discovery) through the use of controlled terms from both a DataCite vocabulary and external vocabularies.

Another reason for using DataCite is the on-going development and improvement of the standard by the community. The institutions involved in this Best Practice Guide also contribute to the discussion. This takes place on the one hand through the continuous exchange with researchers and infrastructure providers, on the other hand through the PID Forum and in direct contact with the DataCite-consortium.

Another strength of DataCite is the continuous improvement of interoperability between different models. While CrossRef continues to be used in the publishing world, DataCite is mainly used for research data and institutional repositories. Interfaces like the CrossRef API and the DataCite API offer the possibility to exchange metadata and provide information on citation rules and recommendations of data publication.22See https://www.crossref.org/blog/data-citation-what-and-how-for-publishers/ There are also mappings to standards like Dublin Core, IDF, OECD and DDI.23See 21-32: http://schema.datacite.org/meta/kernel-2.2/doc/DataCite-MetadataKernel_v2.2_de.pdf

Since the number of data sets produced in the scientific field is constantly growing, communication and networking is also an important component of research data management. The visibility of the data is optimized by complying the specifications of a metadata schema, using DOIs and „universal integration and comprehensive use“24Dreyer u.a. 2019, 120. of additional PIDs. By promoting the representation of citation networks and measuring the output of institutions, the efficiency of science communication is also increased.25For example, the integration of the ROR identifier in DataCite 4.3 makes it possible to search in DataCite Search for the research results of a particular institution. See the blog post: Dasler 2019.

5. Initial Situation

The initial point for considering the development of a Best-Practice-Guide was the development of different data sets from various research projects using generic metadata. The corresponding necessity arose almost simultaneously both at the IT-Group for the Humanities (ITG) of the LMU and in the context of High Performance Computing (HPC) at the LRZ. While the ITG handles research data from the humanities,26The ITG has collected a considerable amount of research data over the past decades. The first DH project at the ITG is the Biblia Hebraica transcripta (BHt) (Richter/Rechenmacher/Riepl 1986-), which origins date back to the 1980s. The projects VerbaAlpina (Krefeld/Lücke 2014-), a lexically oriented project dealing with the linguistic and cultural area of the Alps, and the personnel database „Kaiser und Höfe“ (Imperial Courts and Courtiers) (Hengerer/Schön 2014-), which contains the courtiers of the Austrian Habsburgs of the 16th and 17th centuries, are of a more recent date. The data sets administered by the ITG originate from a large number of individual disciplines and subjects in the Humanities; in addition to Linguistics and History, these include above all Egyptology, Dramatics, Art History, Musicology, Theology and Archaeology. the data at the LRZ are produced by fields like environmental sciences, life sciences and astrophysics. At the ITG as well as at the LRZ, the question increasingly arose, as how large amounts of research data could not only be made available on the long term, but also be generically indexed and thus made searchable and re-usable.

In 2016, the DFG-funded GeRDI project was launched with the participation of the LRZ as a project partner. The aim of the GeRDI project is to integrate research data transparently and across disciplines. The following year, in 2017, the ITG and the LMU’s University Library were able to acquire relevant expertise with their participation in the RDM project „eHumanities – interdisciplinary“27Söllner/Riepl/Weiß 2018-., funded by the Bavarian State Ministry of Science, Research and the Arts. Both projects,selected VerbaAlpina as a pilot project: As a central use case, the structured research data from VerbaAlpina are planned to be enriched with generic metadata in close cooperation with the project managers and LMU’s University Library and to be transferred to LMU’s data repository. In the course of this use case, a clear distribution of roles emerged in which the ITG has the function of a discipline-oriented competence centre, the UB as a data centre and GeRDI as an interdisciplinary  data aggregator.

As a result of this collaboration, it turned out that the DataCite Metadata Schema represents a suitable basis for operational realisation of the project.28For the evaluation of DataCite, see the section „The DataCite Metadata Schema“ In order to cover the broadest possible scenario, in addition to the linguistically oriented VerbaAlpina project, other projects based at the ITG were selected as further examples, which are as complementary as possible (from a technical point of view). At the LRZ, metadata enrichment was carried out exemplary in the context of the ClimEx project29Ludwig 2015-2019; cf. Leduc u.a. 2019. The responsible project managers were asked to enrich their research data with DataCite metadata. A tool for collecting metadata, the DataCite generator, was provided for the use cases of the ITG. The tool was functionally further developed in conjunction with the Best Practice Guide.30For detailed adjustments see Section 8.

The evaluation of DataCite-XML files submitted by the project managers revealed that, despite the use of a uniform metadata schema, too little standardization existed. The identified need to specify and extend the standard stood at the beginning of the development process, which lasted several months and culminated with the publication of the DataCite Best Practice Guide discussed here.

6. The Rationale for a Best Practice Guide

The creation of the Best Practice Guide is doubly motivated: On the one hand, the aim was to create a toolbox to support researchers in using DataCite to describe their digital output and, on the other hand, to streamline the inputs in order to improve the quality of metadata and thus promote the subsequent use of (meta-)data.31Cf. section 2.

The DataCite Best Practice Guide is aimed primarily at researchers who want to upload their research data into a repository and provide it with DataCite-compliant metadata for this purpose. Researchers typically strive to reduce the time necessary to publish the data, especially at the end of a project. Furthermore, they often have little or no experience in handling metadata. The Best Practice Guide not only offers the advantage of a shorter number of pages compared to the official DataCite documentation, but is  also accessible with a low-threshold and adapted to the target group. The aim is high-quality metadata; an efficient involvement of the data producer in the development process brings advantages for all stakeholders: The quality of the data published in the repository will be improved, if the expertise of the researchers is incorporated into the production of the metadata, which was also essential for the methodological production of the data itself. In turn, a specific description of the data makes it easier for repository users to research, and decide on relevance; ultimately, the data producer also benefits from a stronger impact of the described resource.

In the longer term, the Best Practice Guide should also help to raise awareness of metadata quality among the researchers and the influence of this quality on the retrievability and reusability of data. This includes not least a more sensitive handling of data, starting with the modelling of research data and ending with their description in the form of metadata.

Improving the quality of metadata is the overarching concern of this guideline. Standardization of metadata can lead to increased visibility, especially against the background of machine-supported evaluation methods (Weber/Kranzlmüller 2018a). The review of the DataCite metadata generated by researchers has shown that the framework stretched by the DataCite schema is wider than initially assumed: although all the files provided by the researchers are DataCite-compliant, their contents are sometimes very heterogeneous, which makes the interoperability of the metadata a challenge for machines. An example: The strings or literals „LMU“ and „Ludwig-Maximilians-Universität München“ designate the same entity, but cannot easily be detected as identical by a machine.

To remedy this, the Best Practice Guide tightens the framework given by the scheme and makes some of the original specifications more concrete. This is particularly relevant for fields that are otherwise not standardized, especially free text fields. By providing a list of potential input options in such cases, the Best Practice Guide not only makes input easier for the user, but also ensures optimal reusability of the entered information.32This does not imply a complete renunciation of the possibility of entering free text. This would also not be desired by the scientific community. Cf. the results in: Zhang u.a. 2015. A further restriction consists in limiting fields and attributes to a smaller set of options to select from than the options provided by the DataCite standard. This smaller selection is further concretized by canonical examples in order to achieve a uniform use.

Better, i.e. more homogeneous metadata not only make it easier for repository and search engine operators to aggregate the data, they also significantly improve information retrieval: Users searching for research data from a specific subject area expect that their query returns only results relevant in this context. This presupposes that the information about the subject classification of a research data item is not only contained in the relevant metadata, but can also be clearly identified as such by a machine. Especially the subject field of the DataCite scheme is essential for content research; it should therefore not be filled with arbitrary keywords, but with keywords from standardized vocabularies, such as the Common Norms File (GND), Wikidata,  and with the Dewey Decimal Classification (DDC) scheme. Such a standardization impacts the recall and precision of search queries and is also important for the interoperability of the data. The Best Practice Guide promotes the use of globally identified reference data: in addition to references to particularly relevant vocabularies for different entities (persons, organizations, funding agencies etc.), it also contains tips and tools for their use.

7. Development of the Best Practice Guide

The inclusion of all relevant perspectives is essential for the quality of a best practice guide. These perspectives have played a major role in the creation of the guide; the involved institutions contributed to its creation by representing the different roles in the flow of research data from generation, processing, and curation of research data to the aggregation of their metadata by cross-repository services. The following perspectives were represented:

  • The research project VerbaAlpina is a representative of the perspective of data producers. VerbaAlpina is a long-term project funded by the German Research Foundation (DFG). Since 2014, it has been run as an interdisciplinary project jointly by the Romance Linguistics Department of the LMU and the ITG. VerbaAlpina focuses on language and culture in the Alpine region. At the heart of the project is a database that documents the georeferenced designations of alpine-specific concepts, for example from the field of alpine pasture management. In the course of cooperation with the university library of the LMU, the conception and realisation of the enrichment of the database with descriptive and reference metadata is currently taking place. In the medium term, the goal is to permanently place the data stock in the data repository of the LMU in finest granularity and thus make it Findable, Accessible, Interoperable and Reusable (FAIR). According to the project staff, this implies a fine granular approach of the research data.33 The authors follow the discussion on how granularity within research data should be described and plan to comment on this topic in a separate publication. Specifically, not only the overall project, but also the central data categories of VerbaAlpina (morpho-lexical type, community of origin, concept, single reference)34Cf. Krefeld/Lücke 2018c and Krefeld 2018ae. will be indexed with DataCite and identified via DOI.
  • The IT-Group Humanities (ITG) of the LMU represents the perspective of data consultants. The institution has existed at the LMU since 2000 and is jointly supported by the six humanities faculties.  Initially, the planning, realisation, and support of the IT infrastructure, as well as the user support constituted the main part of its activities, while the support of the scientists in the execution of research projects with IT components was rather a marginal phenomenon. The importance of the latter sector significantly increased in the course of the years since the establishment of the ITG. The emergence of the label „Digital Humanities“ (DH)35Cf. fundamental: Jannidis u.a. 2017c. can be seen as an indicator of the importance of digital methods in the humanities as well. It corresponds to the multiplication of research projects accompanied by the ITG.  The role of the ITG in the implementation of DH projects ranges from advising on the grant application process, to the provision and maintenance of the necessary IT infrastructure, to the development and implementation of concepts for the sustainable availability of the research data collected in a DH project. Most of the data stocks at the ITG are stored in structured form in MySQL databases on fail-safe clusters.
  • The University Library of the LMU represents both data publishers and data consultants. Its institutional repository Open Data LMU is in operation since 2011. The repository allows members of the LMU to publish and archive their research data via self-upload. In addition, the publication services Open Access LMU and Open Journals LMU are offered, which also support DOI assignment for uploaded contents. In addition, the existing services are constantly being developed further:  the library plans to offer a new system for the provision of research data, based on Fedora Commons and Project Blacklight, which will complement the ITG services with sustainable data management and long-term availability. One core competence  of the library is the development and dissemination of (meta-)data; among other providers, the library disseminates their data catalogue to GeRDI (see below).
  • Both institutions (library and ITG) cooperate within the framework of the project „eHumanities – interdisziplinär“, which, under the leadership of the University Library of the Friedrich-Alexander-University Erlangen-Nuremberg (FAU), aims to evaluate and develop new technical solutions and services in the field of research data management.36The following study provides insights into the cooperation between researchers in the field of „Digital Humanities“ and research libraries: Wagner Webster 2019. The project is divided into five work packages (WP), of which WP 1 (metadata) plays an important role in the present context, while WP 2 (data management plans) and WP 4 (establishment of services) also play a minor role. ITG and UB LMU jointly form a centre for research data management37As defined in the position paper of the Working Group Data Centres DHd AG Datenzentren 2018, 20f. (with the library as a data centre and ITG as a domain-specific competence centre). The centre advises scientists on the realisation of DH projects and, within the framework of a standardised workflow, also pays attention to compliance with metadata standards, which ultimately takes into account essential aspects of the FAIR principles.
    In the medium term, an essential part of the research data collected at the ITG over the years will be described by metadata retroactively, thus improving their findability, usability and interoperability. The extent to which such a backward-looking action can actually be carried out will become apparent in the course of the project and of course depends crucially on the available human resources. For the future, however, standard procedures for the enrichment with metadata and their subsequent administration are being developed within the framework of the project, which can be integrated into the workflow as a mandatory part of the realisation of future projects.
  • The Leibniz Supercomputing Centre (LRZ) of the Bavarian Academy of the Sciences and Humanities represents the perspective of the data infrastructure providers. The research data service portfolio is primarily aimed at users of HPC systems (as for example the SuperMUC-NG or the linux cluster) and currently concentrates on classical data management, i.e. the efficient connection, storage and transfer of data. With the LTDS architecture38Götz u.a. 2019. a dissemination layer will be added. Outside the HPC area the LRZ offers low-threshold services, such as  Gitlab and Sync&Share, which can be used for the exchange of research data and software.
  • The research project Generic Research Data Infrastructure (GeRDI) represented the perspective of (meta-)data aggregators in the development of the guide. The aim of the project is to support researchers in Germany in data integration tasks, especially in areas that are considered lie inside the „long tail“ of the sciences, i.e. those fields whose data volume is medium to small compared to classical data-intensive sciences such as astrophysics. The main output of the project consists of the organizational and technical building blocks for offering services that link data repositories with each other.

The development of the Best Practice Guide took place in a continuous exchange over several months and was based on the structure of the official DataCite standard. The discussion was based on the requirements of three specific data projects:

  • VerbaAlpina (romance studies)
  • Digital Encyclopedia of Bavarian Musicians (musicology)39Focht 2004-.
  • ClimEx (meteorology and hydrology)

With this selection, the perspectives of different disciplines of research could be taken into account. These projects are examples of a far larger number of projects at the ITG and the LRZ, whose requirements also influenced the creation of the Best Practice Guide.

Developments such as the update of the DataCite standard from version 4.2 to 4.3 and standard data offers gaining impact, such as ROR or ORCID,40Cf. Haak u.a. 2012. were successively incorporated into the discussions. The results were then presented and discussed in research data working groups (rdmuc – Munich Working Group for Research Data), projects („eHumanities -interdisciplinary“) and expert associations (DataCite Consortium, RDA).

Accompanying the development of the Best Practice Guide, adjustments were made to the DataCite generator of the ITG. In a first step, the tool available under a free license for generating metadata was updated from version 4.0 to 4.3. Subsequently, the focus was on adapting the functionality and structure to the requirements of the Best Practice Guide. For better orientation when filling the individual fields, there is an immediate link to the corresponding sections in the guide. In the course of this, the clarity and user-friendliness of the tool were also improved. For example, it is now possible to import existing DataCite XML files into the generator for updates. Parallel to the DataCite generator, the development of the software architecture „Let The Data Sing“ for the FAIR publication of large data sets is advanced at the LRZ. The primary application case is the automated generation and dissemination of metadata based on large amounts of data.41Götz u.a. 2019. This architecture is also based on DataCite and focuses on automated processes.

After the design of the Best Practice Guide and the adaptation of the DataCite generator, a first pre-test was carried out by the actors involved in the elaboration on the basis of the above-mentioned projects. Subsequently, researchers were asked to describe their data collections again in the generator using the guide. Now that this comprehensive evaluation process has been completed, both offers can be made available to the scientific community for the first time.

8. Main features of the DataCite Best Practice Guide

The guide is a clearly structured, concise working tool that is also backed up by a series of practical examples. It is divided into three parts:

  1. General basics, in written in an FAQ style
  2. Detailed notes on all 19 DataCite fields, including short examples from practice
  3. References to four complete examples of metadata; the background of the projects (see section above) is shortly introduced. Two of the examples are related to the Verba Alpina pilot project, in order to provide an example of the granular labelling of research data (here: overall project, single reference) outlined above, and to show the possible potential of this form of indexing.

The Best Practice Guide is a restriction of the specifications documented in the DataCite standard: The guide prescribes stricter handling of optional or recommended fields/attributes and specifies conventions when the DataCite standard allows ambiguous input. It is ensured that metadata created or modified according to the presented guide remains DataCite-compliant.

In addition to the six fields that are defined as mandatory in the DataCite standard itself, the guide also prescribes the following three fields:

  • Subject: Qualified references to central concepts (keywords), a classification of the discipline(s) of research of the resource, and the specification of the location (if applicable) increase the findability of the data set and can be quickly specified with the sources suggested in the guide. It is important to use controlled vocabularies or ontologies (e.g. Wikidata, GND or GeoNames).
  • Description: data sets that are concisely described in the style of an abstracthave a higher probability to be found by interested researchers. Therefore, according to the Best Practice Guide, a descriptive free text with a maximum of 300 words is mandatory (at least in English, optionally also in other languages).
  • Rights: The indication of a license is absolutely necessary in order to create clarity for actors who are interested in re-using the research data. If the chosen license is non-free, i.e. restricts the re-useage in way that necessitates interaction with the rightsholder (e.g. to ask permission for commercial re-usage), a contributor of type „RightsHolder“ is also mandatory.

Restrictions are also provided for attributes of optional or mandatory fields in the Best Practice Guide. This applies in particular to the RelatedIdentifier field: The values of the relationType-attribute of the standard are reduced to subset of the options that are contextualized in the form of canonical application examples.

The following conventions avoid current ambiguities in the standard:

  • Date: Specification of dates and times that do not concern the version history of the data itself, but phenomena that are described by the data (coverage).
  • Subject: identification of locations using standard data from an external service (GeoNames).
  • Description: In addition to the content abstract, DataCite also offers the possibility to provide information on the technical and methodological implementation. A controlled list of terms for this purpose is attached to the Best Practice Guide as a suggestion.
  • Free text fields of all kinds: Each free text field must be available at least in English, other languages are also possible (proper names do not have to be translated). The specification of the language used is in accordance with ISO Standard 639-1.42The use of ISO Standard 639-3 instead of 639-1 would be preferable, as the former covers a wider range of languages. Until this proposal is evaluated by the DataCite consortium, variant 639-1 will continue to be used for reasons of interoperability.

These and other suggestions have been submitted as proposals to the DataCite consortium and are available for consideration in future versions of the metadata schema. The option to uniquely identify institutions using an ROR identifier and the option to specify the information on the research funding received has already been implemented.

9. Summary and outlook

The generic metadata schema DataCite is used worldwide in research data infrastructures and repositories and can rely on a large community to contribute to its further development. The starting point for the development of the Best Practice Guide and thus the motivation for this contribution was the realization that the use of a uniform metadata schema alone is not sufficient to enable interoperability.

The publication of the guide was preceded by a comprehensive development process in which actors from the areas of data generation, preparation and curation, up to (meta-)data aggregation were involved in order to take all relevant perspectives into account. The official DataCite documentation, DataCite experience reports and the extremely heterogeneous DataCite XML files of various research projects located at the ITG and the LRZ served as the source basis. Building on this, a guideline was developed to support researchers in describing their research data and to promote interoperability between research data infrastructures through stronger standardization of inputs. This will contribute to improving the Findability, Interoperability, and Reusability of research data in accordance with the FAIR principles.

In addition to the best practice guide, a DataCite generator for the structured description of the data stock of research projects was adapted and better aligned to the needs of the researchers. Finally, improvement suggestions were submitted to the DataCite consortium as suggestions for future versions of the metadata schema.

Last but not least, the guide contributes to the development of DataCite’s potential as a new source for bibliometric analysis43Robinson-Garcia u.a. 2017. in EU projects such as MakeDataCount44Lowenberg 2017-2019. and FREYA45Lambert/Fenner 2017-2020.. The latter also investigates the distribution and networking of PIDs.46Ferguson u.a. 2018. DataCite plays a central role in the construction of a PID graph.47Fenner/Aryani 2019. This should help to connect the established PID systems and ensure a better supply of information.

It is planned to adapt the Best Practice Guide to new versions and developments at regular intervals. The guide and all related files are located in a Git repository and are open for the community’s contribution and feedback. The authors of this article hope that the Best Practice Guide will be widely accepted by researchers and infrastructure partners alike and will be used beyond Munich.

 

 

Note: There are currently no standard guidelines for the citation of (online) projects. Furthermore, a considerable number of projects do not provide a citation suggestion. In this article, project information that could be obtained from the project website or relevant portals (e.g. DFG GEPRIS) was used, unless a citation suggestion was available.

Bibliographie

  • Álvarez u.a. 2013 = Álvarez, Bárbara / Campbell, Emily / Colman, Jason / Grochowski, Paul F. / Knott, Martin / MacEachern, Mark P. / Martin, Scott T. / Oehrli, Angela / Price, Rebecca H. / Sears, JoAnn / Sferdean, Fe C. / Turkel, Susan Beckwitt (2013): DataCite Implementation Recommendations: A Report of the DataCite Task Force, Ann Arbor (Link).
  • Bode u.a. (2017-) = Bode, Arndt / Grimm, Christian / Hasselbring, Wilhelm / Nagel, Wolfgang / Tochtermann, Klaus (Hrsgg.) (2017-): GeRDI: Generic Research Data Infrastructure, Hamburg/Kiel (Link).
  • Brase u.a. 2015 = Brase, Jan / Sens, Irina / Lautenschlager, Michael (2015): The tenth anniversary of assigning DOI names to scientific data and a five year history of DataCite, in: D-Lib magazine, vol. 21, 1/2, Corporation for National Research Initiatives (Link).
  • Braun 2011 = Braun, Jürgen (2011): Report: Analyse der Metadatenqualität und Interoperabilität, Kompetenzzentrum Interoperable Metadaten (KIM) (Link).
  • CrossRef 2019 = CrossRef (2019): Metadata deposit schema 4.4.2, CrossRef (Link).
  • Dasler 2019 = Dasler, Robin (2019): Affiliation Facet - New in DataCite Search, DataCite Blog, DataCite (Link).
  • DataCite Metadata Working Group 2019 = DataCite Metadata Working Group (2019): DataCite Metadata Schema Documentation for the Publication and Citation of Research Data v4.3, DataCite (Link).
  • DHd AG Datenzentren 2018 = DHd AG Datenzentren (2018): Geisteswissenschaftliche Datenzentren im deutschsprachigen Raum - Grundsatzpapier zur Sicherung der langfristigen Verfügbarkeit von Forschungsdaten, Zenodo (Link).
  • Dreyer u.a. 2019 = Dreyer, Britta / Hagemann-Wilholt, Stephanie / Vierkant, Paul / Strecker, Dorothea / Glagla-Dietz, Stephanie / Summann, Friedrich / Pampel, Heinz / Burger, Marleen (2019): Die Rolle der ORCID iD in der Wissenschaftskommunikation: Der Beitrag des ORCID-Deutschland-Konsortiums und das ORCID-DE-Projekt, in: ABI Technik, vol. 39, 2, De Gruyter, 112-121 (Link).
  • Fenner/Aryani 2019 = Fenner, Martin / Aryani, Amir (2019): Introducing the PID Graph, DataCite (Link).
  • Ferguson u.a. 2018 = Ferguson, Christine / McEntrye, J / Bunakov, V / Lambert, S / van der Sandt, S / Kotarski, R (2018): D3.1 Survey of Current PID Services Landscape , FREYA Project (Link).
  • Focht 2004- = Focht, Josef (Hrsg.) (2004-): Bayerisches Musiker-Lexikon Online, München (Link).
  • Franzke 2017 = Franzke, Cordula (2017): Repositorien für Forschungsdaten am Beispiel der Digital Humanities im nationalen und internationalen Vergleich-Potentiale und Grenzen, in: Perspektive Bibliothek 6.1 (2017), S. 2-33 (Link).
  • Götz u.a. 2019 = Götz, Alexander / Weber, Tobias / Hachinger, Stephan (2019): Let The Data Sing - A Scalable Architecture to Make Data Silos FAIR, Zenodo (Link).
  • Gradl u.a. 2015 = Gradl, Tobias / Henrich, Andreas / Plutte, Christoph (2015): Heterogene Daten in den Digital Humanities: Eine Architektur zur forschungsorientierten Föderation von Kollektionen, in: Constanze Baum und Thomas Stäcker (Hg.), Grenzen und Möglichkeiten der Digital Humanities (= Sonderband der Zeitschrift für digitale Geisteswissenschaften, 1), Wolfenbüttel 2015 (Link).
  • Grunzke 2016 = Grunzke, Richard (2016): Generic Metadata handling in Scientific Data Life Cycles - Kurzfassung, TU Dresden, Dissertation (Link).
  • Grunzke u.a. 2017 = Grunzke, Richard / Adolph, Tobias / Biardzki, Christoph / Bode, Arndt / Borst, Timo / Bungartz, Hans-Joachim / Busch, Anja / Frank, Anton / Grimm, Christian / Hasselbring, Wilhelm u.a. (2017): Challenges in creating a sustainable generic research data infrastructure, in: Softwaretechnik-Trends, vol. 37, 2, 74-77 (Link).
  • Haak u.a. 2012 = Haak, Laurel L / Fenner, Martin / Paglione, Laura / Pentz, Ed / Ratner, Howard (2012): ORCID: a system to uniquely identify researchers, in: Learned Publishing, vol. 25, 4, Wiley Online Library, 259-264 (Link).
  • Helbig u.a. 2015 = Helbig, Kerstin / Hausstein, Brigitte / Toepfer, Ralf (2015): Supporting Data Citation: Experiences and Best Practices of a DOI Allocation Agency for Social Sciences., in: Journal of Librarianship & Scholarly Communication, vol. 3, 2 (Link).
  • Hengerer/Schön 2014- = Hengerer, Mark / Schön, Gerhard (Hrsgg.) (2014-): Kaiser und Höfe. Personendatenbank der Höflinge der Österreichischen Habsburger des 16. und 17. Jahrhunderts, München (Link).
  • Hirschmann 2015 = Hirschmann, Barbara (2015): Entwicklung von Standards und Best Practices im Bereich der Forschungsdatenpublikation: ein Blick auf die Arbeit von DataCite, ETH Zurich, (Präsentation bei Open-Access-Tagen, 8.9.2015) (Link).
  • International DOI Foundation 2017 = International DOI Foundation (2017): DOI Handbook (Link).
  • Jannidis u.a. 2017c = Jannidis, Fotis / Kohle, Hubertus / Rehbein, Malte (Hrsgg.) (2017): Digital Humanities: Eine Einführung, Stuttgart, J.B. Metzler.
  • Kim u.a. 2017 = Kim, Jihyun / Chung, EunKyung / Yoon, JungWon / Lee, Jae Yun (2017): The current state and recommendations for data citation, in: Journal of the Korean Society for information Management, vol. 34, 1, Korean Society for Information Management, 7-29 (Link).
  • Krefeld 2018ae = Krefeld, Thomas (2018): Konzept, in: Methodologie, VerbaAlpina-de 19/1 (Link).
  • Krefeld/Lücke 2014- = Krefeld, Thomas / Lücke, Stephan (Hrsgg.) (2014-): VerbaAlpina. Der alpine Kulturraum im Spiegel seiner Mehrsprachigkeit, München, online, LMU (Link).
  • Krefeld/Lücke 2018c = Krefeld, Thomas / Lücke, Stephan (2018): Typisierung, in: Methodologie, VerbaAlpina-de 19/1 (Link).
  • Lambert/Fenner 2017-2020 = Lambert, Simon / Fenner, Martin (Hrsgg.) (2017-): FREYA (Link).
  • Leduc u.a. 2019 = Leduc, Martin / Mailhot, Alain / Frigon, Anne / Martel, Jean-Luc / Ludwig, Ralf / Brietzke, Gilbert B / Giguère, Michel / Brissette, François / Turcotte, Richard / Braun, Marco u.a. (2019): The ClimEx Project: a 50-member ensemble of climate change projections at 12-km resolution over Europe and Northeastern North America with the Canadian regional climate model (CRCM5), in: Journal of Applied Meteorology and Climatology, vol. 58, 4, 663-693 (Link).
  • Lowenberg 2017-2019 = Lowenberg, Daniella (Hrsg.) (2017-2019): Make Data Count (Link).
  • Ludwig 2015-2019 = Ludwig, Ralf (Hrsg.) (2015-2019): KlimEx (engl. ClimEx). Klimawandel und Hydrologische Extremereignisse – Risiken und Perspektiven für die Wasserwirtschaft in Bayern, München (Link).
  • Müller 2019 = Müller, Lars (2019): Kooperatives Management geisteswissenschaftlicher Forschungsdaten, in: ABI Technik, vol. 39, 3, De Gruyter, 194-201 (Link).
  • Pempe 2012 = Pempe, Wolfgang (2012): Geisteswissenschaften, in: Heike Neuroth u.a. (Hrsg.): Langzeitarchivierung von Forschungsdaten. Eine Bestandsaufnahme. Boizenburg 2012, 138-160 (Link).
  • Pletsch u.a. 2018 = Pletsch, Katja / Hausstein, Brigitte / Dreyer, Britta (2018): DataCite Services als Baustein des Forschungsdatenmanagements in wissenschaftlichen Bibliotheken, in: (Präsentation als Hands-On-Lab beim 107. Deutschen Bibliothekartag, 14.6.2019) (Link).
  • Pohl/Danowski = Pohl, Adrian / Danowski, Patrick: 5.4 Linked Open Data in der Bibliothekswelt - Überblick und Herausforderungen, in: Griebel, Rolf u.a. (Hrsg.): Praxishandbuch Bibliotheksmanagement. Berlin 2014, 392–409 (Link).
  • Richter/Rechenmacher/Riepl 1986- = Richter, Wolfgang / Rechenmacher, Hans / Riepl, Christian (Hrsgg.) (1986-): Biblia Hebraica transcripta (Forschungsdatenbank 3.0), München (Link).
  • Robinson-Garcia u.a. 2017 = Robinson-Garcia, Nicolas / Mongeon, Philippe / Jeng, Wei / Costas, Rodrigo (2017): DataCite as a novel bibliometric source: Coverage, strengths and limitations, in: Journal of Informetrics, vol. 11, 3, Elsevier, 841-854 (Link).
  • Rueda 2016 = Rueda, Laura (2016): DataCite Metadata Schema 4.0 Webinar, DataCite (Link).
  • Rueda u.a. 2016 = Rueda, Laura / Fenner, Martin / Cruse, Patricia (2016): DataCite: Lessons Learned on Persistent Identifiers for Research Data., in: International Journal of Digital Curation Vol 11 No 2 (2016), 39-47, vol. 11, 2, 39-47 (Link).
  • Rühle 2012b = Rühle, Stefanie (2012): Kleines Handbuch Metadaten – Metadaten, Kompetenzzentrum Interoperable Metadaten (KIM) (Link).
  • Söllner/Riepl/Weiß 2018- = Söllner, Konstanze / Riepl, Christian / Weiß, Alexander (Hrsgg.) (2018-): eHumanities - interdisziplinär, Erlangen/München (Link).
  • Starr/Gastl 2011 = Starr, Joan / Gastl, Angela (2011): isCitedBy: A Metadata Scheme for DataCite, in: D-Lib Magazine, vol. 17, 1/2, CNRI Acct (Link).
  • Takeda 2015 = Takeda, Hideaki (2015): Research Data-DOI Experiment in Japanese DOI Registration Agency, Japan Link Center, JaLC, (Präsentation bei CODATA-ICSTI Data Citatation Workshop, 29.10.2015) (Link).
  • Wagner Webster 2019 = Wagner Webster, Jessica (2019): Digital Collaborations: A Survey Analysis of Digital Humanities Partnerships Between Librarians and Other Academics, in: Digital Humanities Quarterly (DHQ), vol. 13, 4, Boston, MA, Alliance of Digital Humanities Organizations (Link).
  • Weber/Kranzlmüller 2018a = Weber, T. / Kranzlmüller, D. (2018): How FAIR Can you Get? Image Retrieval as a Use Case to Calculate FAIR Metrics, in: 2018 IEEE 14th International Conference on e-Science (e-Science), 114-124 (Link).
  • Wilkinson u.a. 2016 = Wilkinson, Mark D. / Dumontier, Michel / Aalbersberg, IJsbrand Jan / Appleton, Gabrielle / Axton, Myles / Baak, Arie / Blomberg, Niklas / Boiten, Jan-Willem / da Silva Santos, Luiz Bonino / Bourne, Philip E u.a. (2016): The FAIR Guiding Principles for scientific data management and stewardship, in: Scientific data, vol. 3, Nature Publishing Group (Link).
  • Zhang u.a. 2015 = Zhang, Yue / Ogletree, Adrian / Greenberg, Jane / Rowell, Chelcie (2015): Controlled vocabularies for scientific data: users and desired functionalities, in: Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community, 54 (Link).