Detecting and avoiding homology-based data leakage in genome-trained sequence models

Abdul Muntakim Rafi
Brett Kiyota
Nozomu Yachie
Carl de Boer

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Models that predict function from DNA sequence have become critical tools in deciphering the roles of genomic sequences and genetic variation within them. However, traditional approaches for dividing the genomic sequences into training data, used to create the model, and test data, used to determine the model’s performance on unseen data, fail to account for the widespread homology that permeates the genome. Using models that predict human gene expression from DNA sequence, we demonstrate that model performance on test sequences varies by their similarity with training sequences, consistent with homology-based ‘data leakage’ that influences model performance by rewarding overfitting of homologous sequences. Because the sequence and its function are inexorably linked, even a maximally overfit model with no understanding of gene regulation can predict the expression of sequences that are similar to its training data. To prevent leakage in genome-trained models, we introduce ‘hashFrag,’ a scalable solution for partitioning data with minimal leakage. hashFrag improves estimates of model performance and can actually increase model performance by providing improved splits for model training. Altogether, we demonstrate how to account for homology-based leakage when partitioning genomic sequences for model training and evaluation, and highlight the consequences of failing to do so.

Version published to 10.1101/2025.01.22.634321v1 on bioRxiv
Jan 24, 2025

MutBERT: Probabilistic Genome Representation Improves Genomics Foundation Models

This article has 4 authors:
1. Weicai Long
2. Houcheng Su
3. Jiaqi Xiong
4. Yanlin Zhang
This article has no evaluationsLatest version Jan 25, 2025
A comprehensive benchmark and guide for sequence-function interpretable deep learning models in genomics

This article has 14 authors:
1. Canzhuang Sun
2. Yu Sun
3. Kang Xu
4. Zhijie He
5. Hao Li
6. Yaru Li
7. Zongyuan Yu
8. Yuyang Wang
9. Xuanwei Lin
10. Xiang Xu
11. Pengzhen Hu
12. Xiaochen Bo
13. Mingzhi Liao
14. Hebing Chen
This article has no evaluationsLatest version Jan 7, 2025
Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics

This article has 3 authors:
1. Gonzalo Benegas
2. Gokcen Eraslan
3. Yun S. Song
This article has no evaluationsLatest version Feb 12, 2025

Listed in

Abstract

Article activity feed

Related articles

MutBERT: Probabilistic Genome Representation Improves Genomics Foundation Models

A comprehensive benchmark and guide for sequence-function interpretable deep learning models in genomics

Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics