ModCRE-NN: Interpretable Deep Learning Harnesses Structural and Evolutionary Synergy to Predict Transcription Factor Binding Specificity
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
We present ModCRE-NN, a machine-learning framework and server for predicting transcription-factor (TF) DNA-binding motifs through the integration of structural and evolutionary information. The method combines structure-derived Position Weight Matrices (PWMs) together with PWMs of homologous spanning multiple evolutionary sequence-identity intervals, which are integrated into a unified 20-channel tensor representation. Benchmark datasets were constructed on experimental databases of TF motifs, showing DNA binding specificity, while redundancy reduction and strict train/test partitioning minimized homology leakage. Prediction quality was evaluated on an independent separated set of TFs using the similarity analysis of profiles. Three complementary architectures were implemented and evaluated: an interpretable regression-based model, a convolutional neural network (CNN), and a Transformer-based architecture using self-attention mechanisms. The regression model achieved strong performance in high-homology regimes dominated by closely related PWMs, whereas CNN and Transformer architectures showed superior robustness under low evolutionary similarity and increased structural uncertainty. Importantly, AI-generated motifs consistently improved the similarity-scores while reducing prediction variance relative to the original structural and evolutionary input motifs, indicating that the models effectively denoise heterogeneous motif assemblies and reconstruct stable consensus DNA-binding representations rather than simply transferring PWMs from the nearest homolog. The CNN model exhibited the most balanced attribution profile, suggesting enhanced ability to combine weak structural and evolutionary signals into coherent motif representations. Additionally, we implemented a prediction-reliability framework combining Random Forest regression, exponential interpolation, and hybrid residual-corrected modeling to estimate the quality and uncertainty of the PWMs as functions of evolutionary similarity, motif-cluster consistency, and TF-family context. Overall, our results demonstrate that integrating structural information with deep learning provides a robust framework for large-scale TF-binding specificity prediction under conditions of substantial evolutionary divergence and motif uncertainty.