A scalable approach to investigating sequence-to-function predictions from personal genomes
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Sequence-to-function (S2F) models hold the promise of evaluating arbitrary DNA sequences, providing a powerful framework for linking genotype to phenotype. Yet, despite strong performance across genomic loci, these models often struggle to capture inter-individual variation in gene expression. To address this, we propose personal genome training —training models to make genotype-specific predictions at a single locus. We introduce SAGE-net, a scalable framework and software package for training and evaluating S2F models using personal genomes. Using SAGE-net, we systematically explore model architectures and training regimes, showing that personal genome training improves gene expression prediction accuracy for held-out individuals. However, performance gains arise primarily from identifying predictive variants, rather than learning a cis -regulatory grammar that generalizes across loci. This lack of generalization persists across a wide range of hyperparameters. In contrast, when applied to DNA methylation (DNAm), personal genome training enables improved generalization to unseen individuals in unseen genomic regions. This suggests that S2F models may more readily capture the sequence-level determinants of inter-individual variation in epigenomic traits. These findings highlight the need for further exploration to unlock the full potential of S2F models in decoding the regulatory grammar of personal genomes. Scalable software and infrastructure development will be critical to this progress.