Benchmarking the coding strategies of non-coding mutations on sequence-based downstream tasks with machine learning

Zhe Liu
Yihang Bao
Wenhao Li
Weihao Li
Guan Nin Lin

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

Non-coding single nucleotide polymorphisms (SNPs) are critical drivers of gene regulation and disease susceptibility, yet predicting their functional impact remains a challenging task. A variety of methods exist for encoding non-coding SNPs, such as direct base encoding or using pre-trained models to obtain embeddings. However, there is a lack of comprehensive evaluation and guidance on the choice of encoding strategies for downstream prediction tasks involving non-coding SNPs. To address this gap, we present a benchmark study that compares six distinct encoding strategies for non-coding SNPs, assessing them across six dimensions, including interpretability, encoding abundance, and computational efficiency. Using three Quantitative Trait Loci (QTL)-related downstream tasks involving non-coding SNPs, we test these encoding strategies in combination with nine machine learning and deep learning models. Our findings demonstrate that semantic embeddings show strong robustness, while the choice of coding strategy and the model used for downstream prediction are all key variables influencing task performance. This benchmark provides actionable insights into the interplay between encoding strategies, models, and data properties, offering a framework for optimizing QTL prediction tasks and advancing the analysis of non-coding SNPs in genomic regulation.

Arcadia Science
Jan 3, 2025

encodingstrategies

Using mean pooling for the larger embeddings for some of the transformer based models makes sense, as that's such a common approach used. However, I was curious if you looked into any other pooling strategies and what impact that had? Or, using a small embedding model, how different performance looks like when the embeddings aren't reduced?

Read the original source
Version published to 10.1101/2025.01.01.631025 on bioRxiv
Jan 2, 2025

Benchmarking DNA Foundation Models for zero-shot variant effect prediction: the role of context, training, and architecture

This article has 4 authors:
1. Ilaria Alfisi
2. Francesca Ciapi
3. Marta Baragli
4. Alberto Magi
This article has no evaluationsLatest version Aug 5, 2025
TransStop, a genomic language model for the pan-drug prediction of translational readthrough efficacy

This article has 4 authors:
1. Nicolas Haas
2. Arnaud Kress
3. Julie D. Thompson
4. Olivier Poch
This article has no evaluationsLatest version Sep 2, 2025
Pretraining Improves Prediction of Genomic Datasets Across Species

This article has 4 authors:
1. Fangrui Huang
2. Yitong Wang
3. Janet Song
4. Ashok Cutkosky
This article has no evaluationsLatest version Aug 24, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

Benchmarking DNA Foundation Models for zero-shot variant effect prediction: the role of context, training, and architecture

TransStop, a genomic language model for the pan-drug prediction of translational readthrough efficacy

Pretraining Improves Prediction of Genomic Datasets Across Species