Spark4VCF: A Novel Big Data Framework to Accelerate Genomics Analysis
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In recent years, the exponential growth of Next Generation Sequencing (NGS) has led to an unprecedented increase in the amount of genomics data. While NGS technologies enable us to read the entire human genome, the analysis of functions of variants and phenotype prediction found in human sequences are still limited by computational tools that usually require high computing overhead due to the gigabytes or terabytes of data to be analyzed. Here we report a powerful big data framework called Spark4VCF which uses Apache Spark engine to accelerate genomics pipelines. Spark4VCF leverages independent attributes between variants and samples to speed up commonly used computational tools while maintaining quality and optimizing I/O tasks through parallel computing. We illustrated the superior speed, CPU usage and memory usage as well as new capability of Spark4VCF by showing example applications of three popular genomics toolboxes: GATK, VEP, and PyPGx. In summary, Spark4VCF is a high-performance framework that provides not only capacity of analyzing high quantities of genomics datasets but also user-friendly applications in big data settings.