A Multi-omics Data Analysis Workflow Packaged as a FAIR Digital Object

This article has been Reviewed by the following groups

Read the full article

Abstract

Background. Applying good data management and FAIR data principles (Findable, Accessible, Interoperable, and Reusable) in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects enable discovery and reuse of Research Objects, including computational workflows for both humans and machines. Practical examples can help promote the adoption of FAIR practices for computational workflows in the research community. We developed a multi-omics data analysis workflow implementing FAIR practices to share it as a FAIR Digital Object. Findings. We conducted a case study investigating shared patterns between multi-omics data and childhood externalizing behavior. The analysis workflow was implemented as a modular pipeline in the workflow manager Nextflow, including containers with software dependencies. We adhered to software development practices like version control, documentation, and licensing. Finally, the workflow was described with rich semantic metadata, packaged as a Research Object Crate, and shared via WorkflowHub. Conclusions. Along with the packaged multi-omics data analysis workflow, we share our experiences adopting various FAIR practices and creating a FAIR Digital Object. We hope our experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.

Article activity feed

  1. Background Applying good data management and FAIR data principles (Findable, Accessible, Interoperable, and Reusable) in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects enable discovery and reuse of Research Objects, including computational workflows for both humans and machines. Practical examples can help promote the adoption of FAIR practices for computational workflows in the research community. We developed a multi-omics data analysis workflow implementing FAIR practices to share it as a FAIR Digital Object.Findings We conducted a case study investigating shared patterns between multi-omics data and childhood externalizing behavior. The analysis workflow was implemented as a modular pipeline in the workflow manager Nextflow, including containers with software dependencies. We adhered to software development practices like version control, documentation, and licensing. Finally, the workflow was described with rich semantic metadata, packaged as a Research Object Crate, and shared via WorkflowHub.Conclusions Along with the packaged multi-omics data analysis workflow, we share our experiences adopting various FAIR practices and creating a FAIR Digital Object. We hope our experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.

    **Reviewer 3 Megan Hagenauer **- Original Submission

    Review of "A Multi-omics Data Analysis Workflow Packaged as a FAIR Digital Object" by Niehues et al. for GigaScience08-31-2023I want to begin by apologizing for the tardiness of this review - my whole family caught Covid during the review period, and it has taken several weeks for us to be functional again.OverviewAs a genomics data analyst, I found this manuscript to be a fascinating, inspiring, and, quite honestly, intimidating, view into the process of making analysis code and workflow truly meet FAIR standards. I have added recommendations below for elements to add to the manuscript that would help myself and other analysts use your case study to plan out our own workflows and code release. These recommendations fall quite solidly into the "Minor Revision" category and may require some editorial oversight as this article type is new to me. Please note that I only had access to the main text of the manuscript while writing this review.Specific Comments1) As a case study, it would be useful to have more explicit discussion of the expertise and effort involved in the FAIR code release and the anticipated cost/benefit ratio:As a data analyst, I have a deep, vested interest in reproducible science and improved workflow/code reusability, but also a limited bandwidth. For me, your overview of the process of producing a FAIR code release was both inspiring and daunting, and left me with many questions about the feasibility of following in your footsteps. The value of your case study would be greatly enhanced by discussing cost/benefit in more detail:a. What sort of expertise or training was required to complete each step in the FAIR release? E.g.,i. Was your use of tools like Github, Jupyter notebook, WorkflowHub, and DockerHub something that could be completed by a scientist with introductory training in these tools, or did it require higher level use?ii. Was there any particular training required for the production of high quality user documentation or metadata? (e.g., navigating ontologies?)b. With this expertise/training in place, how much time and effort do you estimate that it took to complete each step of adapting your analysis workflow and code release to meet FAIR standards?i. Do you think this time and effort would differ if an analyst planned to meet FAIR standards for analysis code prior to initiating the analysis versus deciding post-hoc to make the release of previously created code fit FAIR standards?c. The introduction provides an excellent overview of the potential benefits of releasing FAIR analysis code/workflows. How did these benefits end up playing out within your specific case study?i. e.g., I thought this sentence in your discussion was a particularly important note about the benefits of FAIR analysis code in your study: "Developing workflows with partners across multiple institutions can pose a challenge and we experienced that a secure shared computing environment was key to the success of this project."ii. Has the FAIR analysis workflow also been useful for collaboration or training in your lab?iii. How many of the analysis modules (or other aspects of the pipeline) do you plan on reusing? In general, what do you think is the size for the audience for reuse of the FAIR code? (e.g., how many people do you think will have been saved significant amounts of work by you putting in this effort?)iv. … Or is the primary benefit mostly just improving the transparency/reproducibility of your science?d. If there is any way to easily overview these aspects of your case study (effort/time, expertise, immediate benefits) in a table or figure, that would be ideal. This is definitely the content that I would be skimming your paper to find.2) As a reusable code workflow, it would be useful to provide additional information about the data input and experimental design, so that readers can determine how easily the workflow could be adapted to their own datasets. This information could be added to the text or to Fig 1. E.g.,i. The dimensionality of the input (sample size, number of independent variables & potential co-variates, number of dependent variables in each dataset, etc)ii. Data types for the independent variables, co-variates, and dependent variables (e.g., categorical, numeric, etc)iii. Any collinearity between independent variables (e.g., nesting, confounding).3) As documentation of the analysis, it would be useful to provide additional information about how the analysis workflow may influence the interpretation of the results.a. It would be especially useful to know which aspects of the analysis were preplanned or following a standard procedure/protocol, and which aspects of the analysis were customized after reviewing the data or results. This information can help the reader assess the risk of overfitting or HARKing.b. It would also be useful to call out explicitly how certain analysis decisions change the interpretation of the results. In particular, the decision to use dimension reduction techniques within the analysis of both the independent and dependent variables, and then focus only on the top dimensions explaining the largest sources of variation within the datasets, is especially important to justify and describe its impact on the interpretation of the results. Is there reason to believe that externalizing behavior should be related to the largest sources of variation within buccal DNA methylation or urinary metabolites? Within genetic analyses, the assumption tends to be the opposite - that genetic variation related to behavior (such as externalizing) is likely to be present in a small percent of the genome, and that the top sources of variation within the genetics dataset are uninteresting (related to population) and therefore traditionally filtered out of the data prior to analysis. Within transcriptomics, if a tissue is involved in generating the behavior, some of the top dimensions explaining the largest sources of variation in the dataset may be related to that behavior, but the absolute largest sources of variation are almost always technical artifacts (e.g., processing batches, dissection batches) or impactful sources of biological noise (e.g., age, sex, cell type heterogeneity in the tissue). Is there reason to believe that cheek cells would have their main sources of epigenetic variation strongly related to externalizing behavior? (maybe as a canary in a coal mine for other whole organism events like developmental stress exposure?). Is there reason to believe that the primary variation in urinary metabolites would be related to externalizing behavior? (perhaps as a stand-in for other largescale organismal states that might be related to the behavior - hormonal states? metabolic states? inflammation?). Since the goal of this paper is to provide a case study for creating a FAIR data analysis workflow, it is less important that you have strong answers for these questions, and more important that you are transparent about how the answers to these questions change the interpretation of your results. Adding a few sentences to the discussion is probably sufficient to serve this purpose. Thank you for your hard work helping advance our field towards greater transparency and reproducibility. I look forward to seeing your paper published so that I can share it with the other analysts in our lab.

  2. Background Applying good data management and FAIR data principles (Findable, Accessible, Interoperable, and Reusable) in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects enable discovery and reuse of Research Objects, including computational workflows for both humans and machines. Practical examples can help promote the adoption of FAIR practices for computational workflows in the research community. We developed a multi-omics data analysis workflow implementing FAIR practices to share it as a FAIR Digital Object.Findings We conducted a case study investigating shared patterns between multi-omics data and childhood externalizing behavior. The analysis workflow was implemented as a modular pipeline in the workflow manager Nextflow, including containers with software dependencies. We adhered to software development practices like version control, documentation, and licensing. Finally, the workflow was described with rich semantic metadata, packaged as a Research Object Crate, and shared via WorkflowHub.Conclusions Along with the packaged multi-omics data analysis workflow, we share our experiences adopting various FAIR practices and creating a FAIR Digital Object. We hope our experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.

    Reviewer 2 Dominique Batista - Original Submission

    Very good paper on the FAIR side. You detail what were the challenges, in particular when it comes to the selection of ontologies and terms.It is unclear if the generation of the ISA metadata is included in the workflow. Can a user generate the metadata for the synthetic dataset or their own data using the workflow ?Adding a GitHub action running the workflow with the synthetic data would help reusability but is not required for the publication of the paper.

  3. Background Applying good data management and FAIR data principles (Findable, Accessible, Interoperable, and Reusable) in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects enable discovery and reuse of Research Objects, including computational workflows for both humans and machines. Practical examples can help promote the adoption of FAIR practices for computational workflows in the research community. We developed a multi-omics data analysis workflow implementing FAIR practices to share it as a FAIR Digital Object.Findings We conducted a case study investigating shared patterns between multi-omics data and childhood externalizing behavior. The analysis workflow was implemented as a modular pipeline in the workflow manager Nextflow, including containers with software dependencies. We adhered to software development practices like version control, documentation, and licensing. Finally, the workflow was described with rich semantic metadata, packaged as a Research Object Crate, and shared via WorkflowHub.Conclusions Along with the packaged multi-omics data analysis workflow, we share our experiences adopting various FAIR practices and creating a FAIR Digital Object. We hope our experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.

    This work has been published in *GigaScience *Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad115), and has published the reviews under the same license. These are as follows.

    **Reviewer 1 Carole Goble **- Original Submission

    This work reports a multi-omics data analysis workflow packaged as a RO-Crate, an implementation of a FAIR Digital Object.We limit our comments to the technical aspects of the Research Object and workflow packaging. The scientific validity of the omics analysis itself is outside our expertise.The paper is comprehensive and the background grounding in the current state of the art is excellent and thorough. The paper is an excellent exemplar of the future of data analysis reporting for FAIR and reproducible computational methods, and the amount of work impressive. We congratulate the authors.WorkflowHub entry https://workflowhub.eu/workflows/402?version=5# gives a comprehensive report of the Nextflow workflow and its multiple versions, all the files including the R scripts and the synthetic data. The RO-Crate rendering looks correct and version-locking the R containers is following best practice(https://github.com/Xomics/ACTIONdemonstrator_workflow/blob/main/nextflow.config#L44)T he paper also highlights the amount of work needed to make such a pipeline to be both metadata machine processable and metadata human readable.To make this pipeline reproducible requires a mixture of notebooks submitted as supplementary materials, the Nextflow workflow with its R scripts represented as an RO-Crate in WorkflowHub and a README is linked to the container recipes in https://github.com/Xomics/Docker_containers and then another Documentation.md file. There seems to be the potential for duplicated effort in reporting the necessary metadata describing the workflow that could be highlighted in the Discussion as a steer to the digital object community.- Could the ROCrate approach be widened beyond the current Workflow RO-Crate, and would there be value in streamlining the metadata, or is this just an artefact of the need for multiple descriptions and ease of publishing. If the JSON within the RO-Crate was more richly annotated, could some of the Documentation.md be avoided altogether, and is that even desirable?- The README includes the container/software packaging and is not linked from the RO-Crate (and there isn't an obvious property to link to it yet). Could these be RO-Crates too?- The notebooks in the supplementary files could also be registered in WorkflowHub and linked to the Nextflow workflow (see https://workflowhub.eu/workflows?filter%5Bworkflow_type%5D=jupyter).- Is it feasible and desirable to have a single RO-Crate linked to many other RO-Crates to represent the whole reproducible pipeline in full?In the discussion the FAIR principles verification through different practices and approaches would be more helpful if it was more precise. Comments seem to be limited to the Workflow RO-Crate and use of ontologies for machine readability. As highlighted in table 1 there is more to FAIR software & workflows than this.Minor remarksKey points- We here demonstrate the implementation multiomics data -> We here demonstrate an implementation of an multi-omics data.Background- The documentation of dependencies is highlighted as a prerequisite for software interoperability. In the FAIR4RS principles I2 also highlights qualified references to other objects - presumably other software or installation requirements. This highlights the relationship between software interoperability and software portability. It seems that dependencies more relate to portability rather than interoperability.- "Based on the FDO concept, the RO-Crate approach was specified". This is a confusing statement. ROCrates have been recognised as an implementation approach for the FDO concept as proposed by the FDO Forum. For more discussion on FDO and the Linked Data approach promoted by RO-Crates see https://arxiv.org/abs/2306.07436. However, RO-Crates are not based in the FDO - they are based on the Research Object packaging work that emerged from the EU Wf4ever project, (see https://doi.org/10.1016/j.future.2011.08.004 from 2013).- It is better to describe the RO-Crate metadata file as " It contains all contextual and non-contextual related data to re-run the workflow". Instead of "It can additionally contain data on which the workflow can be run."Workflow Implementation- At the beginning of the last paragraph, "Besides the workflow and the synthetic data set" replace with "As well as the workflow and the synthetic data set".- https://workflowhub.eu/workflows/402?version=5# gives a very nice pictorial overview of the workflow that you may consider including in the paper itself.