GraphPop: graph-native computation decouples population genomics complexity from sample count
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Matrix-based population genomics tools scale as O ( V × N ) , re-reading the full genotype matrix for every analysis. Here we present GraphPop, a graph database engine that reduces summary statistic complexity to O ( V × K ) where K is population count—independent of sample count—by computing on pre-aggregated allele-count arrays stored as graph node properties. The same architecture enables annotation-conditioned queries via edge traversal, persistent analytical records, and multi-statistic composition. Applied to rice 3K (29.6M SNPs, 3,024 accessions) and human 1000 Genomes (3,202 samples, 22 autosomes), GraphPop reveals that all 12 rice subpopulations show π N /π S > 1.0 , uncovers opposite consequence-level Fst regimes between species, and identifies KCNE1 as a candidate pre-Out-of-Africa sweep via convergence of five stored statistics. GraphPop achieves 146–327 × query-time speedup for pre-aggregated statistics and 63–179 × for bit-packed haplotype computation (iHS, XP-EHH, nSL), at constant ∼ 160 MB procedure working memory. This complexity reduction makes systematic, annotation-integrated population genomics practical at scale.