Towards a universal foundation model for biobank-scale human genome variation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Millions of human genomes have been genotyped by national biobanks worldwide. Training large language models (LLM) with this data may lead to a universal model of human genome with tremendous potential. Yet the quadrillions (10 15 ) of nucleotides—resulting from genome length multiplied by population size—pose formidable challenges for modeling. In this study, we propose a novel AI framework designed to scale with this data and support diverse analytical tasks. To demonstrate this scheme, we developed SNPBag—a foundation model focusing on single nucleotide polymorphism (SNP). With about 1 billion parameters, it is trained on one million synthesized human genomes, corresponding to a total of 6 trillion SNP tokens. SNPBag showed superior performance in benchmarking of multiple tasks. In genotype imputation, it achieves state-of-the-art (SOTA) accuracy. In haplotype phasing, it rivals the best method with a 72-fold speedup. By encoding 6 million SNPs per genome into a 0.75 MB embedding, SNPBag enables efficient storage, transfer and downstream applications. In particular, the genome embeddings facilitate rapid ancestry inference across global populations and detection of genetic relationships up to 12th-degree relatives. Collectively, SNPBag introduces a new paradigm for scalable, unified and multitask analysis of the ever-growing human variation data.

Article activity feed