SNPBag: a foundation model for multitask genome-scale SNP analysis
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Foundation models in artificial intelligence are revolutionizing biomedical research, yet single nucleotide polymorphism (SNP) data, critical for advancing biobank studies and decoding human genetic diversity, remain underexplored. Here we introduce SNPBag, a transformer-based foundation model that refines genome-scale SNP analysis. Pre- trained on one million simulated genomes using 0.8 billion parameters, it captures evolutionary signatures across the entire SNP landscape, faithfully encoding linkage disequilibrium and haplotype structures. Unlike conventional "bast-practice" pipelines with multiple software tools, SNPBag supports versatile tasks with a single framework. For genotype imputation, the pre-trained base model matches the performance of leading algorithms and, when fine-tuned, achieves state-of-the-art (SOTA) accuracy. For haplotype phasing, it surpasses non-reference methods and rivals the best reference- based approach while delivering a 72-fold speedup. Notably, SNPBag compresses the full spectrum of six million SNPs per individual into a compact 0.75 MB embedding, enabling efficient storage, transfer and downstream applications. This embedding facilitates rapid ancestry inference across global populations and detection of genetic relationships up to 12th degree. In sum, SNPBag establishes a scalable, self-sufficient, multitasking AI framework, poised to transform SNP data analysis and unlock the growing value of biobank resources.