Pre-training Genomic Language Model with Variants for Better Modeling Functional Genomics

Tianyu Liu
Xiangyu Zhang
Rex Ying
Hongyu Zhao

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Sequence-to-function models can predict gene expression from sequence data and be used to link genetic information with transcriptomics data to understand regulatory processes and their effects on complex phenotypes. The genomic language models are pre-trained with large-scale DNA sequences and can generate robust representations of these sequences by learning the genomic context. However, few studies can estimate the predictability of gene expression levels and bridge these two classes of models together to explore individualized gene expression prediction. In this manuscript, we propose UKBioBERT as a DNA language model pre-trained with genetic variants from UK BioBank. We demonstrate that UKBioBERT generates informative embeddings capable of identifying gene functions, and improving gene expression prediction in cell lines, thereby enhancing our understanding of gene expression predictability. Building upon these embeddings, we combine UKBioBERT with state-of-the-art sequence-to-function architectures, Enformer and Borzoi, to create UKBioFormer and UKBioZoi. These models exhibit better performance in predicting highly predictable gene expression levels and can be generalized across different cohorts. Furthermore, UKBioFormer effectively captures the relationship between genetic variants and expression variations, enabling in-silico mutation analyses. Collectively, our findings underscore the value of integrating genomic language models and sequence-to-function approaches for advancing functional genomics research.

Version published to 10.1101/2025.02.26.640468v1 on bioRxiv
Mar 2, 2025
Version published to 10.1101/2025.02.26.640468v2 on bioRxiv
Mar 2, 2025

A scalable approach to investigating sequence-to-expression prediction from personal genomes

This article has 7 authors:
1. Anna E. Spiro
2. Xinming Tu
3. Yilun Sheng
4. Alexander Sasse
5. Rezwan Hosseini
6. Maria Chikina
7. Sara Mostafavi
This article has no evaluationsLatest version Feb 26, 2025
Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics

This article has 3 authors:
1. Gonzalo Benegas
2. Gokcen Eraslan
3. Yun S. Song
This article has no evaluationsLatest version Mar 4, 2025
Iterative improvement of deep learning models using synthetic regulatory genomics

This article has 2 authors:
1. André M. Ribeiro-dos-Santos
2. Matthew T. Maurano
This article has no evaluationsLatest version Feb 21, 2025

Listed in

Abstract

Article activity feed

Related articles

A scalable approach to investigating sequence-to-expression prediction from personal genomes

Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics

Iterative improvement of deep learning models using synthetic regulatory genomics