A scalable approach to investigating sequence-to-function predictions from personal genomes

Anna E. Spiro
Xinming Tu
Yilun Sheng
Alexander Sasse
Rezwan Hosseini
Maria Chikina
Sara Mostafavi

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Sequence-to-function (S2F) models hold the promise of evaluating arbitrary DNA sequences, providing a powerful framework for linking genotype to phenotype. Yet, despite strong performance across genomic loci, these models often struggle to capture inter-individual variation in gene expression. To address this, we propose personal genome training —training models to make genotype-specific predictions at a single locus. We introduce SAGE-net, a scalable framework and software package for training and evaluating S2F models using personal genomes. Using SAGE-net, we systematically explore model architectures and training regimes, showing that personal genome training improves gene expression prediction accuracy for held-out individuals. However, performance gains arise primarily from identifying predictive variants, rather than learning a cis -regulatory grammar that generalizes across loci. This lack of generalization persists across a wide range of hyperparameters. In contrast, when applied to DNA methylation (DNAm), personal genome training enables improved generalization to unseen individuals in unseen genomic regions. This suggests that S2F models may more readily capture the sequence-level determinants of inter-individual variation in epigenomic traits. These findings highlight the need for further exploration to unlock the full potential of S2F models in decoding the regulatory grammar of personal genomes. Scalable software and infrastructure development will be critical to this progress.

Version published to 10.1101/2025.02.21.639494 on bioRxiv
Feb 21, 2025

GENERator: A Long-Context Generative Genomic Foundation Model

This article has 18 authors:
1. Qiuyi Li
2. Wei Wu
3. Yuanyuan Zhang
4. Zhihao Zhan
5. Ruipu Chen
6. Mingyang Li
7. Kun Fu
8. Junyan Qi
9. Yongzhou Bao
10. Chao Wang
11. Yiheng Zhu
12. Zhiyun Zhang
13. Jian Tang
14. Fuli Feng
15. Jieping Ye
16. Liu Yuwen
17. Hui Xiong
18. Zheng Wang
This article has no evaluationsLatest version Feb 4, 2026
Understanding Pathways in Bioinformatics, Genomics, and Health Applications

This article has 1 author:
1. Diptarup Mallick
This article has no evaluationsLatest version Jan 19, 2026
Decoding Complex Genotype-Phenotype Interactions by Discretizing the Genome

This article has 6 authors:
1. Jędrzej Kubica
2. Hetvi Jethwani
3. Krzysztof H. Banecki
4. Mauricio Moldes
5. Dariusz Plewczynski
6. Ben Busby
This article has no evaluationsLatest version Dec 17, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

GENERator: A Long-Context Generative Genomic Foundation Model

Understanding Pathways in Bioinformatics, Genomics, and Health Applications

Decoding Complex Genotype-Phenotype Interactions by Discretizing the Genome