ISA API: An open platform for interoperable life science experimental metadata

Abstract

Background

The Investigation/Study/Assay (ISA) Metadata Framework is an established and widely used set of open source community specifications and software tools for enabling discovery, exchange, and publication of metadata from experiments in the life sciences. The original ISA software suite provided a set of user-facing Java tools for creating and manipulating the information structured in ISA-Tab—a now widely used tabular format. To make the ISA framework more accessible to machines and enable programmatic manipulation of experiment metadata, the JSON serialization ISA-JSON was developed.

Results

In this work, we present the ISA API, a Python library for the creation, editing, parsing, and validating of ISA-Tab and ISA-JSON formats by using a common data model engineered as Python object classes. We describe the ISA API feature set, early adopters, and its growing user community.

Conclusions

The ISA API provides users with rich programmatic metadata-handling functionality to support automation, a common interface, and an interoperable medium between the 2 ISA formats, as well as with other life science data formats required for depositing data in public databases.

Now published in GigaScience doi: 10.1093/gigascience/giab060

David Johnson 1Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, OX1 3QG, Oxford, United Kingdom2Department of Informatics and Media, Uppsala University, Box 513, 751 20 Uppsala, SwedenFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for David JohnsonAlejandra Gonzalez-Beltran 1Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, OX1 3QG, Oxford, United Kingdom5Science and Technology Facilities Council, Scientific Computing Department, Rutherford Appleton Laboratory, Harwell Campus, Didcot, OX11 0QX, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Alejandra Gonzalez-BeltranKenneth Haug 3European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom6Genome Research Limited, Wellcome Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Saffron Walden CB10 1RQ, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Kenneth HaugMassimiliano Izzo 1Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, OX1 3QG, Oxford, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Massimiliano IzzoMartin Larralde 7Structural and Computational Biology Unit, European Molecular Biology Laboratory (EMBL), Meyerhofstraße 1, 69117 Heidelberg, GermanyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Martin LarraldeThomas N. Lawson 8School of Biosciences, University of Birmingham, Edgbaston, Birmingham, B15 2TT, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Thomas N. LawsonAlice Minotto 4Earlham Institute, Norwich Research Park, Norwich NR4 7UZ, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Alice MinottoPablo Moreno 3European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Pablo MorenoVenkata Chandrasekhar Nainala 3European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Venkata Chandrasekhar NainalaClaire O’Donovan 3European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Claire O’DonovanLuca Pireddu 9Distributed Computing Group, CRS4: Center for Advanced Studies, Research & Development in Sardinia, Pula, ItalyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Luca PiredduPierrick Roger 10CEA, LIST, Laboratory for Data Analysis and Systems’ Intelligence, MetaboHUB, Gif-Sur-Yvette F-91191, FranceFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Pierrick RogerFelix Shaw 4Earlham Institute, Norwich Research Park, Norwich NR4 7UZ, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Felix ShawChristoph Steinbeck 11Cheminformatics and Computational Metabolomics, Institute for Analytical Chemistry, Lessingstr. 8, 07743 Jena, GermanyFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Christoph SteinbeckRalf J. M. Weber 8School of Biosciences, University of Birmingham, Edgbaston, Birmingham, B15 2TT, United Kingdom12Phenome Centre Birmingham, University of Birmingham, Edgbaston, Birmingham, B15 2TT, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Ralf J. M. WeberSusanna-Assunta Sansone 1Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, OX1 3QG, Oxford, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Susanna-Assunta SansoneFor correspondence: susanna-assunta.sansone@oerc.ox.ac.uk philippe.rocca-serra@oerc.ox.ac.ukPhilippe Rocca-Serra 1Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, OX1 3QG, Oxford, United KingdomFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Philippe Rocca-SerraFor correspondence: susanna-assunta.sansone@oerc.ox.ac.uk philippe.rocca-serra@oerc.ox.ac.uk

A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giab060), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

These peer reviews were as follows:

**Reviewer 1: Kevin Menden ** In the paper "ISA API: An open platform for interoperable life science experimental metadata" Johnson et al. present a extensive Python API for reading, writing and handling of metadata in the ISA format. The authors describe the increasing use of the ISA formats and thus indicate the need for better tools to handle such data. The article is well written and good to understand. The ISA tools package contains extensive functionality and a solid documentation. Furthermore, it can be installed with PyPi and Bioconda, which I think should be standard nowadays. The authors furthermore provide a docker image, which is nice. All in all, I think the ISA tools package is a genuinely useful piece of software that is well written, which is why I recommend this manuscript for publication in GigaScience. However, a few minor things should be changed. Personally, I would like to know whether support for the upload to additional databases will be added in the future - this could be noted in the text. The article contains many figures with only little content. I would strongly advise to merge some of these figures into a smaller subset of figures to improve the readability. The authors spend a considerable amount of text on download statistics - something that in my opinion is not really that relevant for the software package. I would recommend to considerably shorten this section. On a similar note, the methods section basically just describes how these download statistics were handled. Considering this article describes a software package, it might be more useful to the reader (and reviewer) to elaborate a bit on how the software is written, maintained, structured, tested - and related things.

**Reviewer 2: Manuel Holtgrewe ** The authors describe the Python library "isatools" for accessing ISA (investigation study assay) files in ISA-tab and ISA-json format. The authors start by sdescribing their previous work around the ISA data model and file formats in detail. They then describe their implementation and the features of their API. They highlight the extensibility and efficiency of their object oriented model. They describe in detail how meta data can be curated in ontologies and that currently extensions are underway for the assisted creation of study meta data. They then refer to early adopters and a stable and growing community. They conclude with the statement that their library is "a major step forward in making the ISA framework open and interoperable".

General Remarks

Overall, we have found the ISA data model and ISA-tab data format to be very useful in our own work. However, there are some issues with the software including apparent bugs as described below. In 2018, my colleagues and me considered using ISA-API in our project for ISA-Tab parsing but the problems and the lack of automated tests made us roll our own (also see below). Overall, the authors make a clear point, the paper is well-written. However, the software appears to be unfinished and some work is required to make it suitable for publication.

Major Issues

The ISA-creator and Bio-GraphIIn are cited as "helped grow the ISA community of users". The authors should offer evidence for this as (a) by our own experience ISA-creator is very hard to use and this is also reflected by the expressed opinion on ISA-creator by anyone I have met so far who has used it and (b) it is not possible to validate how Bio-GraphII has helped grow the community as the website linked to in the cited article is not available anymore and no source code is available, e.g., on Github. The Google groups forum has less than 10 threads per year, with 2 in 2020 so far and one in 2019. The authors should balance these counts with their "PyPi" download counts statistics.
The authors should cite other published APIs for ISA file formats, e.g., AltamISA.

Kuhring, Mathias, et al. "AltamISA: a Python API for ISA-Tab files." Journal of Open Source Software 4.40 (2019): 1610.

The authors should show proof for "efficiency" of their object-oriented model, e.g., by comparing import efficiency with that of altamISA. I'm raising this point as some users raised questions on efficiency when loading/writing data files in the ISA-API Github Issues.
The authors write that development is in progress but it appears from the Github code frequency graph that development has mostly stalled since 2018.
The authors should explain in more detail how stable their API is and what the limitations and assumptions are. In my opinion, one important point in data import and export is looking how data looks after a "round-trip", e.g., import ISA-Tab, followed by export ISA-Tab. I have done this on the official ISA data sets (https://github.com/ISA-tools/ISAdatasets, commit f20be4f83dc5f6f7ec419bfd634efba3177e4ae4). Here are the (to me unexpected results for official example data): (a) On BII-I-1, whole columns disappear such as the first "Material Type" column, (b) All other datasets fail to parse and parsing crashes with Python exceptions. I think the authors should work on these points. It cannot be judged whether the software can be published this point. The software appears unfinished and some more work has to go into it to allow for publication.

Minor Issues

The authors should provide more automated tests for their software. In 2018 when we tried out the package we found some inconsistencies and problems but found it hard to fix bugs in the large body of software because of the lack of comprehensive automated tests.

**Reviewer 3; Chris Hunter ** The manuscript is well written and coherent, it provides a nice balance of historical context of ISA-Tools and the current release of the ISA-API. As a biologist and Biocurator I can attest to the importance of simple to use tools for curation of datasets, and the ISA-creator has been well used by the community over the years. The addition of the ISA-API should allow for more repositories to incorporate the use of ISA formats as both import and export formats for datasets. I have to admit that my lack of experience as a developer means that I am in no position to actually test the API's functionality so I cannot comment as to the technical suitability of the implementation or even whether is works or not! I have been requested to review the manuscript with specific reference to the original reviewer 2 concerns: Reviewer 2 Comment 1"The ISA-creator and Bio-GraphIIn are cited as "helped grow the ISA community of users". The authors should offer evidence for this as; (a) by our own experience ISA-creator is very hard to use and this is also reflected by the expressed opinion on ISA-creator by anyone I have met so far who has used it and (b) it is not possible to validate how Bio-GraphII has helped grow the community as the website linked to in the cited article is not available anymore and no source code is available, e.g., on GitHub. The Google groups forum has less than 10 threads per year, with 2 in 2020 so far and one in 2019. The authors should balance these counts with their "PyPi" download counts statistics." My comment: I believe the authors have addressed the primary concern about the evidence of continued growth in the ISA user community with the detailed description of the PyPi download statistics. The issue of ISA-creators user experience by the reviewer and anecdotal comment of all who have used it, is unfounded and in-fact if true, adds to the argument for the implementation of the ISAAPI as a means to allow a wider developer-base to improve the ISA-creation experience. Reviewer 2 comment 2. "The authors should cite other published APIs for ISA file formats, e.g., AltamISA."

Kuhring, Mathias, et al. "AltamISA: a Python API for ISA-Tab files." Journal of Open Source Software 4.40 (2019): 1610. My comment: The authors have made appropriate changes and included the suggested reference. Reviewer 2 comment 3. "The authors should show proof for "efficiency" of their object-oriented model, e.g., by comparing import efficiency with that of AltamISA. I'm raising this point as some users raised questions on efficiency when loading/writing data files in the ISA-API GitHub Issues." My comment: The authors have replaced the word efficiency with coherent in the manuscript to clarify the meaning in the relevant paragraph. However I'm not sure they have addressed the principle of the concern raised by reviewer 2, i.e. how does the ISA-API compare to other existing models in terms of efficiency? As I have no idea how to measure "efficiency" of a model I'm not convinced this is a valid request from reviewer 2. Reviewer 2 comment 4. "The authors write that development is in progress but it appears from the GitHub code frequency graph that development has mostly stalled since 2018." My comment: I agree with the authors rebuttal of this point, simply looking at GitHub commits is not a suitable measure. Reviewer 2 comment 5. "The authors should explain in more detail how stable their API is and what the limitations and assumptions are. In my opinion, one important point in data import and export is looking how data looks after a "round-trip", e.g., import ISA-Tab, followed by export ISA-Tab. I have done this on the official ISA data sets (https://github.com/ISA-tools/ISAdatasets, commit f20be4f83dc5f6f7ec419bfd634efba3177e4ae4). Here are the (to me unexpected results for official example data): (a) On BII-I-1, whole columns disappear such as the first "Material Type" column, (b) All other datasets fail to parse and parsing crashes with Python exceptions." My comment: Unfortunately my lack of the required skill set to make any sort of tests myself means I am not in a position to adjudicate on this point! I do agree with the authors rebuttal that they cannot assess the reviewers issues based on the minimal information provided in the review. As the authors point out, documentation can always be improved, and 1 such improvement might be to include a "round-trip" example as the reviewer 2 has attempted to show that one can take a valid ISA formatted input, convert it to say SRA format, and back to ISA format using the API and that the input and output ISA formats do indeed match. Reviewer 2 comment Minor issue 1. "The authors should provide more automated tests for their software. In 2018 when we tried out the package we found some inconsistencies and problems but found it hard to fix bugs in the large body of software because of the lack of comprehensive automated tests." My comment: I think this reviewers comment is un-related to the review, they are talking about a version of the tool that is approximately 3 years old, not the current version that they are meant to be reviewing. Despite the irrelevance, the authors have responded by adding text to highlight the Test Driven Development approach taken in the project. With the one caveat already mentioned, i.e. I am unable to actually test the code so I am reliant on the other reviewer to have covered that aspect of the review, I believe the manuscript is suitable for publication as the authors have adequately addressed all of the reviewer 2 comments with the possible exception of improved documentation.

Read the original source