Preference-Based Fine-Tuning of Genomic Sequence Models for Personal Expression Prediction with Data Augmentation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Despite substantial progress in genomic foundation models, accurately predicting inter-individual variation in gene expression from DNA sequence alone remains a major challenge. Current sequence-based models, such as Enformer and Borzoi, trained exclusively on the reference genome, cannot capture the effects of individual-specific regulatory variants. Moreover, the acquisition of paired whole-genome and transcriptome data required for personalized modeling is hindered by privacy and data-sharing constraints. To address this limitation, we integrate genomic data synthesis with established statistical frameworks. Our approach generates thousands of synthetic training samples by simulating genetic variation from the 1000 Genomes Project and assigning pseudo-expression labels using PrediXcan, a validated eQTL-based predictor. Because simulated and real expression values differ in scale and distribution, we introduce a preference-based objective that models relative rather than absolute expression patterns. Fine-tuning Enformer through alternating cycles of real-data regression and synthetic-data preference optimization enables efficient learning from both real and synthesized data. Using the GEUVADIS dataset, our framework outperforms AlphaGenome, PrediXcan, and Enformer fine-tuned without synthesized data, demonstrating that simulation-based integration of population-level regulatory knowledge can effectively mitigate data scarcity and improve cross-individual generalization in sequence-based gene expression prediction.

Availability and implementation

Code and data are available at https://github.com/pacifiic/augment-finetune-genomics .

Article activity feed