Streamlining Data-Intensive Biology With Workflow Systems
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (GigaScience)
Abstract
As the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis, and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of practices and strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these strategies in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.
Author Summary
We present a guide for workflow-enabled biological sequence data analysis, developed through our own teaching, training and analysis projects. We recognize that this is based on our own use cases and experiences, but we hope that our guide will contribute to a larger discussion within the open source and open science communities and lead to more comprehensive resources. Our main goal is to accelerate the research of scientists conducting sequence analyses by introducing them to organized workflow practices that not only benefit their own research but also facilitate open and reproducible science.
Article activity feed
-
Abstract
A version of this preprint has been published in the Open Access journal GigaScience (see paper [insert DOI here]), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
These peer reviews were as follows:
Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102603 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102604
-
-
-