Predicting dynamic expression patterns in budding yeast with a fungal DNA language model

Kuan-Hao Chao
Majed Mohamed Magzoub
Emily Stoops
Sean Hackett
Johannes Linder
David R. Kelley

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Predicting gene expression from DNA sequence remains challenging due to complex regulatory codes. We introduce a masked DNA language model pretrained on 165 fungal genomes closely related to budding yeast that captures conserved regulatory grammar. Fine-tuning the LM on yeast RNA-seq data—including high-resolution transcriptional regulator induction time courses generated in this study—yielded Shorkie, a model that substantially improves gene expression prediction compared to baselines trained without self-supervision. Shorkie identified canonical transcription factor (TF) binding motifs and tracked their usage across induction experiments. Furthermore, Shorkie accurately predicted variant effects, outperforming leading sequence-to-expression models in cis -eQTL classification and achieving high concordance with massively parallel reporter assays. Interpretability analyses revealed Shorkie’s ability to resolve promoter dynamics, splicing signals, and temporal changes in regulatory motif usage. This framework demonstrates that evolutionary-scale pretraining combined with transfer learning substantially improves our ability to decode gene regulation from sequence, providing insights into noncoding variants and regulatory networks.

Version published to 10.1101/2025.09.19.677475 on bioRxiv
Sep 21, 2025

Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

This article has 1 author:
1. Hayden Farquhar
This article has no evaluationsLatest version Feb 4, 2026
GENERator: A Long-Context Generative Genomic Foundation Model

This article has 18 authors:
1. Qiuyi Li
2. Wei Wu
3. Yuanyuan Zhang
4. Zhihao Zhan
5. Ruipu Chen
6. Mingyang Li
7. Kun Fu
8. Junyan Qi
9. Yongzhou Bao
10. Chao Wang
11. Yiheng Zhu
12. Zhiyun Zhang
13. Jian Tang
14. Fuli Feng
15. Jieping Ye
16. Liu Yuwen
17. Hui Xiong
18. Zheng Wang
This article has no evaluationsLatest version Feb 4, 2026
Understanding Pathways in Bioinformatics, Genomics, and Health Applications

This article has 1 author:
1. Diptarup Mallick
This article has no evaluationsLatest version Jan 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

GENERator: A Long-Context Generative Genomic Foundation Model

Understanding Pathways in Bioinformatics, Genomics, and Health Applications