FAIR data station for lightweight metadata management and validation of omics studies

Bart Nijsse
Peter J Schaap
Jasper J Koehorst

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (GigaScience)

Abstract

Background

The life sciences are one of the biggest suppliers of scientific data. Reusing and connecting these data can uncover hidden insights and lead to new concepts. Efficient reuse of these datasets is strongly promoted when they are interlinked with a sufficient amount of machine-actionable metadata. While the FAIR (Findable, Accessible, Interoperable, Reusable) guiding principles have been accepted by all stakeholders, in practice, there are only a limited number of easy-to-adopt implementations available that fulfill the needs of data producers.

Findings

We developed the FAIR Data Station, a lightweight application written in Java, that aims to support researchers in managing research metadata according to the FAIR principles. It implements the ISA metadata framework and uses minimal information metadata standards to capture experiment metadata. The FAIR Data Station consists of 3 modules. Based on the minimal information model(s) selected by the user, the “form generation module” creates a metadata template Excel workbook with a header row of machine-actionable attribute names. The Excel workbook is subsequently used by the data producer(s) as a familiar environment for sample metadata registration. At any point during this process, the format of the recorded values can be checked using the “validation module.” Finally, the “resource module” can be used to convert the set of metadata recorded in the Excel workbook in RDF format, enabling (cross-project) (meta)data searches and, for publishing of sequence data, in an European Nucleotide Archive–compatible XML metadata file.

Conclusions

Turning FAIR into reality requires the availability of easy-to-adopt data FAIRification workflows that are also of direct use for data producers. As such, the FAIR Data Station provides, in addition to the means to correctly FAIRify (omics) data, the means to build searchable metadata databases of similar projects and can assist in ENA metadata submission of sequence data. The FAIR Data Station is available at https://fairbydesign.nl.

GigaScience
Apr 10, 2023

Background

Reviewer2-Sveinung Gundersen

The paper describes the FAIR Data Station, which is a lightweight application written in Java that facilitates FAIR-by-design by allowing the collection of structured metadata from the first phase of a project. To this end, the authors have applied and extended the ISA metadata framework to form a core data structure wherein attributes from a library of 40 frequently used minimal information checklists can be placed. The FAIR Data Station contains tools for generating and validating Excel metadata files, as well as conversion to RDF format as well as to a European Nucleotide Archive(ENA) compatible XML metadata file for submission.General comments:The FAIR Data Station (FAIR-DS) seems to be a useful application to help life science researchers to collect and structure metadata according to the …

Background

Reviewer2-Sveinung Gundersen

The paper describes the FAIR Data Station, which is a lightweight application written in Java that facilitates FAIR-by-design by allowing the collection of structured metadata from the first phase of a project. To this end, the authors have applied and extended the ISA metadata framework to form a core data structure wherein attributes from a library of 40 frequently used minimal information checklists can be placed. The FAIR Data Station contains tools for generating and validating Excel metadata files, as well as conversion to RDF format as well as to a European Nucleotide Archive(ENA) compatible XML metadata file for submission.General comments:The FAIR Data Station (FAIR-DS) seems to be a useful application to help life science researchers to collect and structure metadata according to the FAIR principles. The software is based on core community standards, ontologies and checklists. As for deposition databases, the software currently seems to only integrate with ENA, which, on the other hand, is a central deposition database.The three main contributions of FAIR-DS is to my mind A) the metadata schema that has been carefully constructed by the authors, B) the validation functionality of metadata against said schema, and C) functionality for conversion of validated metadata into RDF and deposition formats There are, however, some architectural choices and technical limitations in the implementation that I have issues with and which makes me uncertain whether the software shows enough "innovation in the approach, implementation, or have added benefits", as mentioned in the "Instructions for Authors"(https://academic.oup.com/gigascience/pages/technical_note). I would therefore invite the authors to address the following issues:1. The authors state that "the FAIR-DS uses an extended version of the original three-tier Investigation, Study, Assay (ISA) metadata framework [https://isa-tools.org]". This leads the reader to think that the software applies the full ISA Abstract Model (https://isa-specs.readthedocs.io/en/latest/isamodel.html), which is not correct. Only the top level objects and a few attributes are retained. It is also not clear why the authors have found it necessary to add additional, custom object types, such as "Observation unit", explained as "the "object" from which the measurements are taken". The ISA model includes an attribute "source material" which seems to overlap. The authors have also added "sample" as a top-level object, even though there is already a "sample" attribute in the ISA model. It is unclear to me what is improved by adding new object types and whether any such improvements will outweigh the obvious drawbacks that comes with not following a community standard for the metadata schema.2. The FAIR-DS makes use of Excel files as an intermediate format for collection of user metadata. While the feature set of Excel and its familiarity for most users are good arguments its adoption, I miss a discussion on the fact that a commercial product is included in the core architecture of the system. FAIR principle I1 promote that: "(Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation". As Excel is only an intermediate metadata format, while RDF is used for the final output, the FAIR-DS does not directly break principle I1, however I think the choice of a commercial file format is not following the "spirit" of FAIR. I see no reason why CSV could not be included as an alternative to Excel and that the authors could recommend an Open Source application as alternative for users that wish their entire software suite to remain in the Open Source domain.3. The metadata schema is not represented in a standard schema format, such as JSON Schema, Frictionless table schema, or similar. Using a shared format for representing the metadata schema makes it possible to make use of general validation libraries (such as the ELIXIR Biovalidator: https://doi.org/10.1093/bioinformatics/btac195). Shared schema formats also allows for reuse of the schema in other contexts/software. In FAIR-DS, the metadata schema seems to be primarily represented in an implicit way in the Java source code that generates the Excel files as a secondary representation of the schema. Even though the FAIR principles might not directly include a recommendation to share of the metadata schema in a FAIR way, one can argue that this falls under R1.3: "(Meta)data meet domain-relevant community standards". It would in any case be in "the spirit of FAIR".4. As a consequence of issue 3, the validation functionality is also specified implicitly in the Java source code and does not seem to reuse much external validation functionality. I particularly miss validation of ontology terms against the relevant ontologies, as well as more stringent validation of PMIDs, DOIs etc, preferable using CURIEs instead of URLs. All of these data types only seem to be validated as general strings, which is of limited use. Users might for instance introduce spelling variants for ontology term labels without this being detected by the validator.5. Due to the hard-coded nature of the metadata schema, the validator and the conversion functionality, I suspect the authors might not have designed the system flexibly enough to allow for easy updates based on updates in the external dependencies, i.e. the minimal information checklists, ontologies, or deposition schemas. For instance, EMBL-EBI, who are hosting ENA, are moving towards requiring the submission of sample data/metadata to BioSamples, prior to submitting the metadata to ENA, which might have consequences for the checklist requirements. Also, ontologies in particular are known to be updated regularly.6. I am not convinced that the authors have done a careful enough search of the literature to list relevant software solutions for comparison. For instance, the FAIRDOM Seek solution (https://doi.org/10.1186/s12918-015-0174-y) is not cited directly, although the functionality seems to be highly overlapping.7. The manuscript would benefit from careful proofreading of the language and grammar.When addressing these issues, I would urge the authors to better demonstrate "innovation in the approach, implementation, or ... added benefits",

Read the original source
GigaScience
Apr 10, 2023

Abstract

Reviewer1-Dominique Batista: An overall a strong paper that creates a new bridge between the ISA model and the FAIR principles.A few points should be addressed:- page 2:* "As one Investigation can have several research lines, each Study layer has a unique identifier ...": how do you generate these identifiers and control their uniqueness, persistency and stability? Are these identifiers resolvable ?* "As an extension to the original three-tier ISA-model in between Study and Assay two additional layers of information were added Observation unit and Sample": would you clarify what problems were addressed ? More generally speaking, does the FAIR-DS integrate with existing implementation of the ISA model ? Did you consider a conversion and submission to external systems such as the ones mentioned in the conclusion ?* The text for …

Abstract

Reviewer1-Dominique Batista: An overall a strong paper that creates a new bridge between the ISA model and the FAIR principles.A few points should be addressed:- page 2:* "As one Investigation can have several research lines, each Study layer has a unique identifier ...": how do you generate these identifiers and control their uniqueness, persistency and stability? Are these identifiers resolvable ?* "As an extension to the original three-tier ISA-model in between Study and Assay two additional layers of information were added Observation unit and Sample": would you clarify what problems were addressed ? More generally speaking, does the FAIR-DS integrate with existing implementation of the ISA model ? Did you consider a conversion and submission to external systems such as the ones mentioned in the conclusion ?* The text for figure 1 is good, but the corresponding text in the core of the document is hard to read and understand.* "Model specific attributes are optionally selected by the user": Does this mean users can add extra fields on top of the provided packages or that they have to select fields within the given package ?-page 3:* "In addition, we included regular expressions obtained from the ENA checklist, such as "(0|((0Ë™)|([1-9][0-9]Ë™?))[0-9])([Ee][+-]?[0-9]+)? (g|mL|mg|ng)" for sample volume or weight for DNA extraction": good point. Is their a mechanism for users to add new regex ?

Read the original source
Version published to 10.1093/gigascience/giad014
Dec 28, 2022
Version published to 10.1101/2022.08.03.502622 on bioRxiv
Aug 5, 2022