MutBERT: Probabilistic Genome Representation Improves Genomics Foundation Models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Motivation
Understanding the genomic foundation of human diversity and disease requires models that effectively capture sequence variation, such as single nucleotide polymorphisms (SNPs). While recent genomic foundation models have scaled to larger datasets and multi-species inputs, they often fail to account for the sparsity and redundancy inherent in human population data, such as those in the 1000 Genomes Project. SNPs are rare in humans, and current masked language models (MLMs) trained directly on whole-genome sequences may struggle to efficiently learn these variations. Additionally, training on the entire dataset without prioritizing regions of genetic variation results in inefficiencies and negligible gains in performance.
Results
We present MutBERT, a probabilistic genome-based masked language model that efficiently utilizes SNP information from population-scale genomic data. By representing the entire genome as a probabilistic distribution over observed allele frequencies, MutBERT focuses on informative genomic variations while maintaining computational efficiency. We evaluated MutBERT against DNABERT-2, various versions of Nucleotide Transformer, and modified versions of MutBERT across multiple downstream prediction tasks. MutBERT consistently ranked as one of the top-performing models, demonstrating that this novel representation strategy enables better utilization of biobank-scale genomic data in building pretrained genomic foundation models.
Availability
https://github.com/ai4nucleome/mutBERT
Contact
yanlinzhang@hkust-gz.edu.cn