Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics

Gonzalo Benegas
Gokcen Eraslan
Yun S. Song

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Machine learning holds immense promise in biology, particularly for the challenging task of identifying causal variants for Mendelian and complex traits. Two primary approaches have emerged for this task: supervised sequence-to-function models trained on functional genomics experimental data and self-supervised DNA language models that learn evolutionary constraints on sequences. However, the field currently lacks consistently curated datasets with accurate labels, especially for non-coding variants, that are necessary to comprehensively benchmark these models and advance the field. In this work, we present TraitGym, a curated dataset of regulatory genetic variants that are either known to be causal or are strong candidates across 113 Mendelian and 83 complex traits, along with carefully constructed control variants. We frame the causal variant prediction task as a binary classification problem and benchmark various models, including functional-genomics-supervised models, self-supervised models, models that combine machine learning predictions with curated annotation features, and ensembles of these. Our results provide insights into the capabilities and limitations of different approaches for predicting the functional consequences of non-coding genetic variants. We find that alignment-based models CADD and GPN-MSA compare favorably for Mendelian traits and complex disease traits, while functional-genomics-supervised models Enformer and Borzoi perform better for complex non-disease traits. The benchmark, including a Google Colab notebook to evaluate a model in a few minutes, is available at https://huggingface.co/datasets/songlab/TraitGym.

Version published to 10.1101/2025.02.11.637758v1 on bioRxiv
Feb 12, 2025

Iterative improvement of deep learning models using synthetic regulatory genomics

This article has 2 authors:
1. André M Ribeiro-dos-Santos
2. Matthew T Maurano
This article has no evaluationsLatest version Feb 21, 2025
MutBERT: Probabilistic Genome Representation Improves Genomics Foundation Models

This article has 4 authors:
1. Weicai Long
2. Houcheng Su
3. Jiaqi Xiong
4. Yanlin Zhang
This article has no evaluationsLatest version Jan 25, 2025
Combining Directed Evolution with Machine Learning Enables Accurate Genotype-to-Phenotype Predictions

This article has 6 authors:
1. Alexander J. Howard
2. Ellen Y. Rim
3. Oscar D. Garrett
4. Yejin Shim
5. James H. Notwell
6. Pamela C. Ronald
This article has no evaluationsLatest version Jan 29, 2025

Listed in

Abstract

Article activity feed

Related articles

Iterative improvement of deep learning models using synthetic regulatory genomics

MutBERT: Probabilistic Genome Representation Improves Genomics Foundation Models

Combining Directed Evolution with Machine Learning Enables Accurate Genotype-to-Phenotype Predictions