SODAR: managing multi-omics study data and metadata

Mikko Nieminen
Oliver Stolpe
Mathias Kuhring
January Weiner
Patrick Pett
Dieter Beule
Manuel Holtgrewe

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (GigaScience)

Abstract

Scientists employing omics in life science studies face challenges such as the modeling of multi assay studies, recording of all relevant parameters, and managing many samples with their metadata. They must manage many large files that are the results of the assays or subsequent computation. Users with diverse backgrounds, ranging from computational scientists to wet-lab scientists, have dissimilar needs when it comes to data access, with programmatic interfaces being favored by the former and graphical ones by the latter.

We introduce SODAR, the system for omics data access and retrieval. SODAR is a software package that addresses these challenges by providing a web-based graphical user interface for managing multi assay studies and describing them using the ISA (Investigation, Study, Assay) data model and the ISA-Tab file format. Data storage is handled using the iRODS data management system, which handles large quantities of files and substantial amounts of data. SODAR also offers programmable APIs and command line access for metadata and file storage.

SODAR supports complex omics integration studies and can be easily installed. The software is written in Python 3 and freely available at https://github.com/bihealth/sodar-server under the MIT license.

GigaScience
Aug 21, 2023
AbstractScientists employing omics in life science studies face challenges such as the modeling of multi assay studies, recording of all relevant parameters, and managing many samples with their metadata. They must manage many large files that are the results of the assays or subsequent computation. Users with diverse backgrounds, ranging from computational scientists to wet-lab scientists, have dissimilar needs when it comes to data access, with programmatic interfaces being favored by the former and graphical ones by the latter.We introduce SODAR, the system for omics data access and retrieval. SODAR is a software package that addresses these challenges by providing a web-based graphical user interface for managing multi assay studies and describing them using the ISA (Investigation, Study, Assay) data model and the ISA-Tab file …
AbstractScientists employing omics in life science studies face challenges such as the modeling of multi assay studies, recording of all relevant parameters, and managing many samples with their metadata. They must manage many large files that are the results of the assays or subsequent computation. Users with diverse backgrounds, ranging from computational scientists to wet-lab scientists, have dissimilar needs when it comes to data access, with programmatic interfaces being favored by the former and graphical ones by the latter.We introduce SODAR, the system for omics data access and retrieval. SODAR is a software package that addresses these challenges by providing a web-based graphical user interface for managing multi assay studies and describing them using the ISA (Investigation, Study, Assay) data model and the ISA-Tab file format. Data storage is handled using the iRODS data management system, which handles large quantities of files and substantial amounts of data. SODAR also offers programmable APIs and command line access for metadata and file storage.SODAR supports complex omics integration studies and can be easily installed. The software is written in Python 3 and freely available at https://github.com/bihealth/sodar-server under the MIT license.Competing Interest StatementThe authors have declared no competing interest.

**Reviewer 2. Philippe Rocca-Serra **

The reviewer thanks the authors for their efforts in producing the submitted manuscript. The authors describe a django based web application designed to support data management. The tool is built to support experimental metadata capture using the ISA format in its tsv form. The tool relies on irods to manage data files associated with the experimental metadata. The tool offers programmatic access via an API and clear front end.

Main comments: The title: "SODAR: enabling, modeling, and managing multi-omics integration studies" could be clearer.Being more concise "SODAR: standard compliant management of multi-omics studies " would deliver a better message. Page 1 , Abstract: it would benefit from further refinement as there are several repetitions. Check 3rd sentence for English. "ranging from....to..." , s/whereas/to/"Scientists from diverse backgrounds also have different demands for interfacing with the data, ranging from computational users that need programmatic or command line access whereas non-computational users need graphical interfaces. "to:"Scientists, with different backgrounds, ranging from computational scientists to wet-lab scientists, have different needs when it comes to data access, with programmatic interfaces being favoured by the former and graphical ones by the latter". Instead of saying "under a permissive licence", be more explicit and plainly state "under MIT licence. "Page 2, Introduction:what is the difference between " data analysis and integration of data"? Repetition/redundancy in "An example of such complex study is (Esterhuyse et al., 2015) in infection biology, which will be used as an example below. "Suggestion:Use of term "modeling": using "plan" or "planning" may be better to remove any ambiguity about the nature of the modelling (statistical modeling, data modeling). Alternating, perfer 'representation' or 'representing'. (the term model is repeated many times in the following sentences) The statement "The most comprehensive standard for describing study metadata is the ISA-Tab format ..." is probably too strong. There are more formal (UML) models such as FUGE-OM (https://doi.org/10.1038/nbt1347 ) or CDISC SDM & SDTM.A more understated assessment such as "a popular standard, owing to its simplicity, is the ISA-Tab format""Alternatives include..." possibly cite other options for managing such complex datasets as seen with BIDS in neuroscience (Gorgolewski, K., Auer, T., Calhoun, V. et al. The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Sci Data 3, 160044 (2016). https://doi.org/10.1038/sdata.2016.44) or why not mention HDF5 specification. This section could be improved by refining the transitions between the different ideas presented or organising the flow. For example, by layout out the challenges of 1/ dealing with experimental metadata and 2/ dealing with digital objects produced by instruments, which have the characteristics outlined by the authors (volume, depth). Then review the technical solutions and then present the choices made by this implementation and possibly identify the selection criteria which led to choosing one specification over another. Results:Page 4: " Non-computational users can interface with SODAR using the graphical UI, whereas computational users can use command line interfaces and REST APIs from scripts and other external software. "Repeat from the abstract. I would suggest rephrasing to 'humanise' 'computational users' vs 'non-computation users', and identifying the function and roles in actual labs (bioinformaticians, data analysts, aka dry lab scientists) vs (experimentalists, wet-lab biologists). Figure 1: same comment (in fact confirming by the choice of characters).a question about the diagram: Is it the case that the Web UI does not talk to server via the API as done in some modern development. Probably highlight there the reliance on the Django framework. Section 2.1The first sentence needs attention, check the English. "for both serving for modeling experiments..."Also, there are systems (EBI Metabolights tools on their github repo, DataVerse, FAIRdom SEEK, Zendro...).So the story telling should probably first talk about the survey of the existing and then only bring to arguments justifying new development. Table 1.It is odd to lump blanket statements for tools such as LIMS, ELN or 'Study Databases' without clearly stating which ones specifically have been evaluated. It seems that one could formulate a table with very different results.

Question: How was selection bias controlled for? Page 5:This section should be reorganised and each explanatory statement refined to add clarity. Case in point:"Arbitrary Experiments": Does experiment equate 'ISA.Assay'? is it akin to a Workflow or process Sequence ? Question: among the key feature that such a system should have to support the work of dry/wet lab scientists, surely, deposition to public repositories should be high on the list. Why is this absent? Page 6:typo: s/bioinfsormaticians/bioinformaticians/punctuation: to be checked: missing commas make for a difficult read.suggestion: simplify the role of 'experimentalists' in the context of SOBAR."They use the templates provided by the Data Stewards to instantiate a wet lab track and track its metadata." Question: How are data stewards trained in ISA-Tab? Access to the demo tool gives the opportunity to use and test the component. While the UI is simple and intuitive, a number of limitations in the editing functionality make usage more difficult that it needs to be.Page 7:"of course, using the REST-API of SODAR, it is possible to automate these tasks" Could the author produce a jupyter notebook showing how to do so? It would be a nice addition and possibly a good resource that could facilitate uptake. Section 2-3:page 8-9-10: this section could be streamlined and condensed to really focus on the interaction between shaping a sample processing & data acquisition workflow into a template which can be used by a wet lab scientists. All this while allowing a markup with ontology terms. Note: the ontology terms on the demo server do not resolve properly. Question: Why choosing Bioportal over other services, e.g. EBI OLS? Question: How can value-sets be constrained in SODAR? Question: ontology browser: it is unclear if the ontologies need to be loaded locally or if they are accessed via an API call to the relevant services ? Can the authors clarify this point? the demo server did not seem to allow it or I wasn't able. may be a figure showing the functionality would help? Page 11: Internal Usage Statistics Question: it seems that the mean size of an experiment stored in SODAR is ~60 samples and about 10 files per sample. These are relatively small sized studies. Can the authors provide insights about the performance of the platform with large studies (several thousands of samples and above)?

Methods: Question: Installation and deployment of SODAR.Why the authors omit to mention that SODAR can be deployed via Docker? It seems useful information. Question: AltamISAChecking the library, it seems that development has stalled. It is a concern? Have the authors tested swapping AltamISA with ISA-API ? Is it at all possible ? could it be made via an adaptor of some sort? Can Altam ISA convert to ISA-JSON or other public repository compatible format to provide a capability to assist users disseminate their results? Comment: figure 3 should not be a supplementary material but a proper content as it is useful as showcasing SODAR UI and customization.

Re-review: The reviewer thank the authors for their efforts and extensive rework of the manuscripts, and for delivering this software stack. minor corrections:

page 4, 2nd paragraph, first sentence: typo -> s/approaching itusing/approaching it using/page 7, 2nd paragraph, suggested edit:change from: "For publication, raw and processed data and metadata are deposited in scientific catalogues, study databases and registries. An example is the BioSamples database for metadata [22].""to:For publication, metadata and raw or processed data are deposited in scientific catalogues, study databases and registries. Examples are the BioSamples database for metadata [22] and Short Read Archive for raw sequencing data [citation needed]."

"important clarifications:

this sentence makes a disservice to the manuscript: "Our work isrepresentative of the work typically done by core units in clinics. Clinical settings often deal with humans as their primary sample source. This implies controlled access of data, or not being allowed to share confidential data. Thus, developing support for hosting data in a public repository is not our aim. Likewise, uploading data to other public repositories has not been a priority. "Two reasons:- the first one is opening the can of worms of data governance and oversight of patient related information. I would steer clear of that in this piece.- the second one is because i would flip the argument around. "While deposition to public repositories was not necessarily the priority, the development of an (almost, see below ) ISA compliant system provides such a capability should the data owner need it"

in the result section, or in the documentation, a welcome addition would be example of templates for non-sequencing based assays. For instance, since the authors mentioned their need to support proteomics and mass-spectrometry users, it would make sense to highlight the templates available. In other words, it would help the target audience of the manuscript locate 'metadata profile definitions' (somewhat akin to ISA configurations) for specific assay types. If I have missed it from the manuscript or the github repo, please ignore.

"dialectic" ISA format:Several examples are available from the GitHub repository generally follow the ISA-Tab specifications but also introduce a local field: "Library Name". While such value would make sense in the official ISA specification, it is currently not supported. This leads to the creation of a diverging format.It would be sensible to keep the "Library Name" as an presentation label (for display in the UI) and substitute it to "Labeled Extract Name" when exporting outside the database to the tab format, in order to retain compatibility with other ISA parser and the official specifications. It could be added as an output option to the Altam-ISA parser in case deposition to public repositories is needed (e.g. EMBL-Metabolights). This would go some way in helping 'Interoperability' and would not be too onerous a change. Worth of note, I was recently made aware that ENA repository would be accepting submission in ISA-Tab and ISA-JSON format, hence raising this point to the authors. Suggestion: clarify this in the Methods section. Also, it seems the following example is missing 'Assay Name' and 'Raw Data File' fields:https://raw.githubusercontent.com/bihealth/sodar- paper/main/GSE96583_PBMC_Single-Cell_Demo_Project/a_PBMC_test_scRNAseq_nucleotide_sequencing.txt
Read the original source
GigaScience
Aug 21, 2023

AbstractScientists employing omics in life science studies face challenges such as the modeling of multi assay studies, recording of all relevant parameters, and managing many samples with their metadata. They must manage many large files that are the results of the assays or subsequent computation. Users with diverse backgrounds, ranging from computational scientists to wet-lab scientists, have dissimilar needs when it comes to data access, with programmatic interfaces being favored by the former and graphical ones by the latter.We introduce SODAR, the system for omics data access and retrieval. SODAR is a software package that addresses these challenges by providing a web-based graphical user interface for managing multi assay studies and describing them using the ISA (Investigation, Study, Assay) data model and the ISA-Tab file …

AbstractScientists employing omics in life science studies face challenges such as the modeling of multi assay studies, recording of all relevant parameters, and managing many samples with their metadata. They must manage many large files that are the results of the assays or subsequent computation. Users with diverse backgrounds, ranging from computational scientists to wet-lab scientists, have dissimilar needs when it comes to data access, with programmatic interfaces being favored by the former and graphical ones by the latter.We introduce SODAR, the system for omics data access and retrieval. SODAR is a software package that addresses these challenges by providing a web-based graphical user interface for managing multi assay studies and describing them using the ISA (Investigation, Study, Assay) data model and the ISA-Tab file format. Data storage is handled using the iRODS data management system, which handles large quantities of files and substantial amounts of data. SODAR also offers programmable APIs and command line access for metadata and file storage.SODAR supports complex omics integration studies and can be easily installed. The software is written in Python 3 and freely available at https://github.com/bihealth/sodar-server under the MIT license.

This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad052) and has published the reviews under the same license. These are as follows.

**Reviewer 1. Xiaotao Shen **

The authors developed the SODAR tool, which supports multi-omics integration studies. This is a great tool that has a user-friendly interface and supports multi-omics integration. However, I have several concerns that need to be addressed before this manuscript can be considered to be published. How does the SODAR handle the multi-omics data that are from different samples? For example, the gut microbiome data from stool samples and proteomics data from blood samples, which may be from the same person but collected at different dates. Since SPDAR supports cell editing, so how does it make the metadata and expression data consistent automatically? The authors claim that the SODAR can support multi-omics integration studies. However, I didn't find out how SODAR can do that. Could the authors give more descriptions about that?

Re-review: The authors have addressed all my comments and concerns.

Read the original source
Version published to 10.1101/2022.08.19.504516v3 on bioRxiv
May 17, 2023
Version published to 10.1101/2022.08.19.504516v2 on bioRxiv
Mar 30, 2023
Version published to 10.1101/2022.08.19.504516v1 on bioRxiv
Aug 22, 2022

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed