General, orders-of-magnitude faster whole-genome analysis with genotype representation graphs

Drew DeHaas
Chris Adonizio
Ziqing Pan
Xinzhu Wei

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Whole-genome sequencing (WGS) of biobank-scale cohorts have generated datasets that traditional tabular genotype formats cannot efficiently store or analyze. Genotype Representation Graphs (GRGs) offer a compelling alternative: a biologically-motivated, hierarchical, graph-based representation that compactly and losslessly encodes the genotypes, and that supports computation directly on the graph rather than on a materialized genotype matrix. Here we introduce two advances that together make GRG a practical foundation for biobank-scale population and statistical genetics. First, we present GRG v2, a substantially improved format and construction algorithm that reduces construction time by 10-20×, halves the disk and RAM footprint of the resulting files, and improves load time by more than 20×. Applied to the recently phased UK Biobank WGS dataset (490,541 individuals, 706,556,181 variants), GRG v2 produces files 25 times smaller than . vcf.gz and more than 8 times smaller than PLINK2 ’s PGEN format, while costing less than 90 GBP to construct. Second, we introduce grapp , a Python library and command-line tool that exploits the computational advantages of GRG for both routine analyses and new method development. grapp provides standard pipelines for variant and sample filtering, genome-wide association studies (GWAS) with covariates, principal component analysis (PCA), and data export, all implemented as graph-based operations. Moreover, it provides linear operators that integrate with the numpy and scipy sparse linear algebra ecosystem, enabling implicit matrix multiplication against the standardized genotype matrix, the linkage disequilibrium matrix, and the genetic relatedness matrix all via an underlying GRG. Using these operators, scipy -based PCA can be implemented in four lines of Python and runs 51–492× faster than existing methods while using less RAM. PCA on 89,988,512 variants in the UK Biobank runs in two to four hours. This scalability allows us to introduce a leave-one-chromosome-out (LOCO) approach to GWAS covariate construction that avoids LD artifacts without requiring LD pruning. Together, GRG v2 and grapp enable a level of scalability and methodological flexibility that is not achievable with traditional genotype formats.

Version published to 10.64898/2026.04.10.717786 on bioRxiv
Apr 11, 2026

Highly efficient genotype compression leveraging genealogical relatedness

This article has 4 authors:
1. Amber Shen
2. Xinran Wang
3. Nicholas Mancuso
4. Luke J. O’Connor
This article has no evaluationsLatest version May 1, 2026
GraphPop: graph-native computation decouples population genomics complexity from sample count

This article has 5 authors:
1. Ehsan Estaji
2. Shi-Wei Zhao
3. Zhao-Yang Chen
4. Shuai Nie
5. Jian-Feng Mao
This article has no evaluationsLatest version Apr 14, 2026
Optimizing phenotype scale improves genetic analyses in large-scale biobanks

This article has 3 authors:
1. Zhenhong Huang
2. Manuela Costantino
3. Andy Dahl
This article has no evaluationsLatest version May 7, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Highly efficient genotype compression leveraging genealogical relatedness

GraphPop: graph-native computation decouples population genomics complexity from sample count

Optimizing phenotype scale improves genetic analyses in large-scale biobanks