DPAC: Prediction and Design of Protein-DNA Interactions via Sequence-Based Contrastive Learning
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Interactions between DNA and proteins are pivotal in natural biological processes, and designing proteins that can bind to DNA with high specificity is crucial for advancing genomic technologies. Existing state-of-the-art models for both modeling and designing protein-DNA interactions primarily rely on structural information, facing limitations in scalability and efficiency for large-scale applications. Notable methods like AlphaFold 3 and RosettaTTAFold All-Atom exist, but they are inefficient and inherently struggle at modeling conformationally unstable proteins, such as transcription factors, which arguably represent the most important class of DNA-binding proteins. Here, we present DPAC 1 ( D NA- P rotein binding A lignment via C ontrastive learning), which leverages pre-trained protein and DNA language models via a contrastive loss to align the two modalities in a high-dimensional shared latent space. DPAC not only significantly accelerates the design process compared to current structure-based methods but also demonstrates a strong ability to differentiate real binders from non-binders. Our model achieves an AUC score of 0.591 on a low identity set, outperforming state-of-the-art structure-based methods. Additionally, DPAC integrates simulated annealing for the design of new protein sequences with optimized DNA binding affinity, successfully recovering binding affinity in engineered sequences by up to 20% in in silico tests. Our results highlight DPAC’s potential for facilitating the design and discovery of sequence-specific DNA-binding proteins, paving the way for advancements in genomic research and biotechnology applications.