Exploring Protein-DNA Binding Residue Prediction and Consistent Interpretability Analysis Using Deep Learning
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurately identifying DNA-binding residues is a crucial step in developing computational tools to model DNA-protein binding properties, which is essential for binding pocket discovery and related drug design. Although several tools have been developed to predict DNA-binding residues based on protein sequences and structures, their performance remains limited, and proteins with crystal structures still represent only a small fraction of DNA-binding proteins. Additionally, the process of extracting handcrafted features for protein representation is labor-intensive. In this study, we combined the strengths of pre-trained protein language models and attention mechanisms to propose a sequence-based method: an attention-based deep learning approach for accurately predicting DNA-binding residues, incorporating a contrastive learning module. Our method outperformed all other sequence-based models across two prevalent benchmark datasets. Furthermore, we developed a structure-based graph neural network (GNN) model to demonstrate the impact of the contrastive module. A common limitation of existing models is their lack of interpretability, which hinders our ability to understand what these models have learned. To address this, we introduced a novel perspective for interpreting our sequence-based model by analyzing the consistency between attention scores and the edge weights generated by the GNN model. Interestingly, our results show that large-scale pre-trained protein language models, together with attention mechanisms, can effectively capture structural information solely from protein sequence inputs.