ModCRE-NN: Interpretable Deep Learning Harnesses Structural and Evolutionary Synergy to Predict Transcription Factor Binding Specificity

Victor Méndez-Riosalido
Patrick Gohl
Patricia M. Bota
Eric Kramer
Alberto Meseguer
Oriol Gallego
Narcis Fernandez-Fuentes
Baldo Oliva

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We present ModCRE-NN, a machine-learning framework and server for predicting transcription-factor (TF) DNA-binding motifs through the integration of structural and evolutionary information. The method combines structure-derived Position Weight Matrices (PWMs) together with PWMs of homologous spanning multiple evolutionary sequence-identity intervals, which are integrated into a unified 20-channel tensor representation. Benchmark datasets were constructed on experimental databases of TF motifs, showing DNA binding specificity, while redundancy reduction and strict train/test partitioning minimized homology leakage. Prediction quality was evaluated on an independent separated set of TFs using the similarity analysis of profiles. Three complementary architectures were implemented and evaluated: an interpretable regression-based model, a convolutional neural network (CNN), and a Transformer-based architecture using self-attention mechanisms. The regression model achieved strong performance in high-homology regimes dominated by closely related PWMs, whereas CNN and Transformer architectures showed superior robustness under low evolutionary similarity and increased structural uncertainty. Importantly, AI-generated motifs consistently improved the similarity-scores while reducing prediction variance relative to the original structural and evolutionary input motifs, indicating that the models effectively denoise heterogeneous motif assemblies and reconstruct stable consensus DNA-binding representations rather than simply transferring PWMs from the nearest homolog. The CNN model exhibited the most balanced attribution profile, suggesting enhanced ability to combine weak structural and evolutionary signals into coherent motif representations. Additionally, we implemented a prediction-reliability framework combining Random Forest regression, exponential interpolation, and hybrid residual-corrected modeling to estimate the quality and uncertainty of the PWMs as functions of evolutionary similarity, motif-cluster consistency, and TF-family context. Overall, our results demonstrate that integrating structural information with deep learning provides a robust framework for large-scale TF-binding specificity prediction under conditions of substantial evolutionary divergence and motif uncertainty.

Version published to 10.64898/2026.05.27.728137 on bioRxiv
May 29, 2026

BiLSTM-Powered Bilinear Attention for Protein–Ligand Prediction

This article has 4 authors:
1. Chih-Yang Cheng
2. Yi-An Chen
3. Feng-Yin Li
4. Suyong Re
This article has no evaluationsLatest version May 13, 2026
Cross-Attention Over RNA And Protein Sequences Enables Generalizable Interaction Prediction

This article has 7 authors:
1. Mario Catalano
2. Gerardo Pepe
3. Gabriele Ausiello
4. Claire McWhite
5. Giorgio Gambosi
6. Manuela Helmer Citterich
7. Pier Federico Gherardini
This article has no evaluationsLatest version Apr 23, 2026
Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites

This article has 5 authors:
1. Pavel Kravchenko
2. Ilya E. Vorontsov
3. Vsevolod J. Makeev
4. Ivan V. Kulakovskiy
5. Dmitry D. Penzar
This article has no evaluationsLatest version May 14, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

BiLSTM-Powered Bilinear Attention for Protein–Ligand Prediction

Cross-Attention Over RNA And Protein Sequences Enables Generalizable Interaction Prediction

Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites