A SNP Foundation Model: Application in Whole-Genome Haplotype Phasing and Genotype Imputation

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Artificial intelligence foundation models are revolutionizing biomedical research. Human genetic diversity data, predominantly in the form of single nucleotide polymorphisms (SNPs), constitutes the largest and fastest growing portions of biobank databases. However, SNP data has been a missing piece in the Bio-AI domain. Here, we present SNPBag, the first SNP foundation model for genome-scale SNP analyses. Built on a large language model (LLM), SNPBag aims to address haplotype phasing and genotype imputation simultaneously. Using data from the 1000 Genome Project, we pre-trained SNPBag on overlapping sequences by masking genotypes. For haplotype phasing task, our model achieved a state-of-the-art (SOTA) 1% switch error rate, outperforming BEAGLE5.2, EAGLE2, and SHAPEIT4. In genotype imputation for the Illumina Omni2.5 SNP array, the model reached an accuracy of 96.88%, comparable to the best available methods. Additionally, when fine-tuned for ancestry inference, the model achieved 97% accuracy in predicting the super-populations of 72 individuals from the 1000 Genome Project. This study demonstrates the great potential of SNPBag in representing and inferring global human genetic diversity. Future research will focus on training SNPBag with large-scale Genome-Wise Association Study (GWAS) data to predict phenotypes from whole-genome genotypes, opening new avenues for understanding human genetics and disease.

Article activity feed