Integrating Protein and DNA Embeddings for Improving Genome-Wide Transcription Factor Binding Site Prediction

Shreya Basnet
Jianlin Cheng

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Transcription factors (TFs) regulate gene expression by binding to specific DNA sites on genome, making accurate TF binding site prediction critical for understanding gene regulation and downstream phenotypes. Current deep learning methods use only DNA-related information to predict TF binding sites, ignoring the fact that different TF protein sequences and structures recognize distinct DNA patterns. Not leveraging TF information not only limits prediction accuracy but also makes the methods not generalizable to predicting binding sites of new TFs that do not exist in the traning data. Here, we present TransBind, a protein-aware deep learning architecture that integrates DNA sequence information with protein embeddings containing both sequence and structural information derived from a protein language model pretrained on DNA-binding proteins, to improve TF binding site prediction. Through the cross-attention, a TF embedding selectively attends to genomic regions according to its unique binding properties. Evaluated on the data of 690 ChIP-seq experiments spanning 161 TFs across 91 human cell types, TransBind achieves an AUROC of 0.950 and AUPR of 0.371—representing a ≥11.3% relative AUPR improvement over state-of-the-art methods including TBiNet, DanQ, and DeepSEA. The model outperformed existing methods in ≥97.1% of TF–cell type combinations. It also recovered 160 known TF binding motifs in the JASPAR database, providing the biological interpretability of the model. Moreover, the approach enables zero-shot prediction for unseen TFs, demonstrating its potential of generalizing to new, poorly characterized TFs. The source code of TransBind is available at https://github.com/jianlin-cheng/TransBind .

Version published to 10.1101/2025.09.15.676319 on bioRxiv
Sep 17, 2025

Multimodal learning decodes the global binding landscape of chromatin-associated proteins

This article has 10 authors:
1. Jimin Tan
2. Xi Fu
3. Xinyu Ling
4. Shentong Mo
5. Jiangshan Bai
6. Raúl Rabadán
7. David Fenyö
8. Jef D. Boeke
9. Aristotelis Tsirigos
10. Bo Xia
This article has no evaluationsLatest version Aug 17, 2025
BindPred: A Framework for Predicting Protein-Protein Binding Affinity from Language Model Embeddings

This article has 4 authors:
1. Haixing Piao
2. Veda Sheersh Boorla
3. Somtirtha Santra
4. Costas D. Maranas
This article has no evaluationsLatest version Sep 29, 2025
Enhancing Protein Binding Site Residue Prediction with Graph Neural Networks: Impacts of Cutoff Distance and Feature Selection

This article has 3 authors:
1. Serena H. Chen
2. Massimiliano Lupo Pasini
3. Cory D. Hauck
This article has no evaluationsLatest version Aug 26, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Multimodal learning decodes the global binding landscape of chromatin-associated proteins

BindPred: A Framework for Predicting Protein-Protein Binding Affinity from Language Model Embeddings

Enhancing Protein Binding Site Residue Prediction with Graph Neural Networks: Impacts of Cutoff Distance and Feature Selection