Benchmarking the coding strategies of non-coding mutations on sequence-based downstream tasks with machine learning

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Non-coding single nucleotide polymorphisms (SNPs) are critical drivers of gene regulation and disease susceptibility, yet predicting their functional impact remains a challenging task. A variety of methods exist for encoding non-coding SNPs, such as direct base encoding or using pre-trained models to obtain embeddings. However, there is a lack of comprehensive evaluation and guidance on the choice of encoding strategies for downstream prediction tasks involving non-coding SNPs. To address this gap, we present a benchmark study that compares six distinct encoding strategies for non-coding SNPs, assessing them across six dimensions, including interpretability, encoding abundance, and computational efficiency. Using three Quantitative Trait Loci (QTL)-related downstream tasks involving non-coding SNPs, we test these encoding strategies in combination with nine machine learning and deep learning models. Our findings demonstrate that semantic embeddings show strong robustness, while the choice of coding strategy and the model used for downstream prediction are all key variables influencing task performance. This benchmark provides actionable insights into the interplay between encoding strategies, models, and data properties, offering a framework for optimizing QTL prediction tasks and advancing the analysis of non-coding SNPs in genomic regulation.

Article activity feed

  1. encodingstrategies

    Using mean pooling for the larger embeddings for some of the transformer based models makes sense, as that's such a common approach used. However, I was curious if you looked into any other pooling strategies and what impact that had? Or, using a small embedding model, how different performance looks like when the embeddings aren't reduced?