Streamlining Large-Scale Genomic Data Management: Insights from the UK Biobank Whole-Genome Sequencing Data

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Biobank-scale Whole-Genome Sequencing (WGS) studies are increasingly pivotal in unraveling the genetic bases of diverse health outcomes. However, managing and analyzing these datasets’ sheer volume and complexity presents significant challenges. We propose vcf2agds , an all-in-one toolkit that efficiently converts WGS data from Variant Call Format (VCF) format to the annotated Genomic Data Structure (aGDS) format, significantly reducing data size while supporting seamless genomic and functional data integration for comprehensive genetic analyses. The toolkit was applied to the UK Biobank 500k WGS data, resulting in twenty-three aGDS files, one for each chromosome, which collectively compressed 1,473.85 Tebibytes of pVCF data into 1.10 Tebibytes. Utilizing these aGDS files, we conducted a functionally informed rare variant association analysis of total cholesterol employing the STAARpipeline and detected 480 genome-wide significant coding and noncoding associations. Overall, vcf2agds offers a streamlined approach facilitating the efficient management and analysis of biobank-scale WGS data across hundreds of thousands of samples.

Article activity feed