Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites

Pavel Kravchenko
Ilya E. Vorontsov
Vsevolod J. Makeev
Ivan V. Kulakovskiy
Dmitry D. Penzar

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Motivation

DNA motifs recognised by transcription factors are typically represented as position weight matrices (PWMs), assuming independent contributions of individual nucleotides to protein binding specificity. Many alternative models accounting for correlations of positional contributions have been introduced in the past decades. However, performance gains have generally not outweighed the advantages of simplicity, interpretability, and practical applicability of PWMs with the well-established codebase. Existing software tools and motif databases provide multiple non-identical PWMs for the same transcription factor or even for the same dataset. It remains a practical question whether these PWMs can be effectively combined into a single improved model.

Results

Here we describe ArChIPelago ( https://github.com/autosome-ru/ArChIPelago ), a computational framework that combines multiple PWMs into a joint model using classic machine learning techniques, from linear regression to ensembles of decision trees. We show that such a combination improves prediction of transcription factor binding sites in genomic sequences. With a diverse collection of 704 ChIP-Seq datasets spanning 36 orthologous human and mouse transcription factors of diverse structural families, we show that ArChIPelago consistently outperforms the best available individual mono- and dinucleotide PWMs as well as sparse local inhomogeneous mixture models. Furthermore, using both human and mouse data, we demonstrate that PWM ensembles are capable of making reliable cross-species predictions.

Version published to 10.64898/2026.05.12.724515 on bioRxiv
May 14, 2026

ModCRE-NN: Interpretable Deep Learning Harnesses Structural and Evolutionary Synergy to Predict Transcription Factor Binding Specificity

This article has 8 authors:
1. Victor Méndez-Riosalido
2. Patrick Gohl
3. Patricia M. Bota
4. Eric Kramer
5. Alberto Meseguer
6. Oriol Gallego
7. Narcis Fernandez-Fuentes
8. Baldo Oliva
This article has no evaluationsLatest version May 29, 2026
Deep Learning of High-throughput Transcription Factor–DNA Binding Affinity Data: Quantitative Comparison with Pairwise-Additive Models

This article has 3 authors:
1. Ke Shen
2. Zhi Wang
3. Xiaoliang Sunney Xie
This article has no evaluationsLatest version May 19, 2026
SLiMNet: a deep learning model to detect short linear motifs using protein large language model representations and paired inputs

This article has 2 authors:
1. Matthew C. McFee
2. Philip M. Kim
This article has no evaluationsLatest version May 7, 2026

Discuss this preprint

Listed in

Abstract

Motivation

Results

Article activity feed

Related articles

ModCRE-NN: Interpretable Deep Learning Harnesses Structural and Evolutionary Synergy to Predict Transcription Factor Binding Specificity

Deep Learning of High-throughput Transcription Factor–DNA Binding Affinity Data: Quantitative Comparison with Pairwise-Additive Models

SLiMNet: a deep learning model to detect short linear motifs using protein large language model representations and paired inputs