Preference-Based Fine-Tuning of Genomic Sequence Models for Personal Expression Prediction with Data Augmentation

Moonwon Choi
Bokeum Cho
Seunggeun Lee

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Despite substantial progress in genomic foundation models, accurately predicting inter-individual variation in gene expression from DNA sequence alone remains a major challenge. Current sequence-based models, such as Enformer and Borzoi, trained exclusively on the reference genome, cannot capture the effects of individual-specific regulatory variants. Moreover, the acquisition of paired whole-genome and transcriptome data required for personalized modeling is hindered by privacy and data-sharing constraints. To address this limitation, we integrate genomic data synthesis with established statistical frameworks. Our approach generates thousands of synthetic training samples by simulating genetic variation from the 1000 Genomes Project and assigning pseudo-expression labels using PrediXcan, a validated eQTL-based predictor. Because simulated and real expression values differ in scale and distribution, we introduce a preference-based objective that models relative rather than absolute expression patterns. Fine-tuning Enformer through alternating cycles of real-data regression and synthetic-data preference optimization enables efficient learning from both real and synthesized data. Using the GEUVADIS dataset, our framework outperforms AlphaGenome, PrediXcan, and Enformer fine-tuned without synthesized data, demonstrating that simulation-based integration of population-level regulatory knowledge can effectively mitigate data scarcity and improve cross-individual generalization in sequence-based gene expression prediction.

Availability and implementation

Code and data are available at https://github.com/pacifiic/augment-finetune-genomics .

Version published to 10.1101/2025.11.09.687505 on bioRxiv
Nov 11, 2025

Decoding Complex Genotype-Phenotype Interactions by Discretizing the Genome

This article has 6 authors:
1. Jędrzej Kubica
2. Hetvi Jethwani
3. Krzysztof H. Banecki
4. Mauricio Moldes
5. Dariusz Plewczynski
6. Ben Busby
This article has no evaluationsLatest version Dec 17, 2025
Understanding Pathways in Bioinformatics, Genomics, and Health Applications

This article has 1 author:
1. Diptarup Mallick
This article has no evaluationsLatest version Jan 19, 2026
GENERator: A Long-Context Generative Genomic Foundation Model

This article has 18 authors:
1. Qiuyi Li
2. Wei Wu
3. Yuanyuan Zhang
4. Zhihao Zhan
5. Ruipu Chen
6. Mingyang Li
7. Kun Fu
8. Junyan Qi
9. Yongzhou Bao
10. Chao Wang
11. Yiheng Zhu
12. Zhiyun Zhang
13. Jian Tang
14. Fuli Feng
15. Jieping Ye
16. Liu Yuwen
17. Hui Xiong
18. Zheng Wang
This article has no evaluationsLatest version Feb 4, 2026

Discuss this preprint

Listed in

Abstract

Availability and implementation

Article activity feed

Related articles

Decoding Complex Genotype-Phenotype Interactions by Discretizing the Genome

Understanding Pathways in Bioinformatics, Genomics, and Health Applications

GENERator: A Long-Context Generative Genomic Foundation Model