General, orders-of-magnitude faster whole-genome analysis with genotype representation graphs
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Whole-genome sequencing (WGS) of biobank-scale cohorts have generated datasets that traditional tabular genotype formats cannot efficiently store or analyze. Genotype Representation Graphs (GRGs) offer a compelling alternative: a biologically-motivated, hierarchical, graph-based representation that compactly and losslessly encodes the genotypes, and that supports computation directly on the graph rather than on a materialized genotype matrix. Here we introduce two advances that together make GRG a practical foundation for biobank-scale population and statistical genetics. First, we present GRG v2, a substantially improved format and construction algorithm that reduces construction time by 10-20×, halves the disk and RAM footprint of the resulting files, and improves load time by more than 20×. Applied to the recently phased UK Biobank WGS dataset (490,541 individuals, 706,556,181 variants), GRG v2 produces files 25 times smaller than . vcf.gz and more than 8 times smaller than PLINK2 ’s PGEN format, while costing less than 90 GBP to construct. Second, we introduce grapp , a Python library and command-line tool that exploits the computational advantages of GRG for both routine analyses and new method development. grapp provides standard pipelines for variant and sample filtering, genome-wide association studies (GWAS) with covariates, principal component analysis (PCA), and data export, all implemented as graph-based operations. Moreover, it provides linear operators that integrate with the numpy and scipy sparse linear algebra ecosystem, enabling implicit matrix multiplication against the standardized genotype matrix, the linkage disequilibrium matrix, and the genetic relatedness matrix all via an underlying GRG. Using these operators, scipy -based PCA can be implemented in four lines of Python and runs 51–492× faster than existing methods while using less RAM. PCA on 89,988,512 variants in the UK Biobank runs in two to four hours. This scalability allows us to introduce a leave-one-chromosome-out (LOCO) approach to GWAS covariate construction that avoids LD artifacts without requiring LD pruning. Together, GRG v2 and grapp enable a level of scalability and methodological flexibility that is not achievable with traditional genotype formats.