PIPENN-EMB: ensemble net and protein embeddings generalise protein interface prediction beyond homology
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Protein interactions are crucial for understanding biological functions and disease mechanisms, but predicting these remains a complex task in computational biology. Increasingly, Deep Learning models are having success in interface prediction. This study presents PIPENN-EMB which explores the added value of using embeddings from the ProtT5-XL protein language model. Our results show substantial improvement over the previously published PIPENN model for protein interaction interface prediction, reaching an MCC of 0.313 vs. 0.249, and AUC-ROC 0.800 vs. 0.755 on the BIO_DL_TE test set. We furthermore show that these embeddings cover a broad range of ‘hand-crafted’ protein features in ablation studies. PIPENN-EMB reaches state-of-the-art performance on the ZK448 dataset for protein-protein interface prediction. We showcase predictions on 25 resistance-related proteins from Mycobacterium tuberculosis . Furthermore, whereas other state-of-the-art sequence-based methods perform worse for proteins that have little recognisable homology in their training data, PIPENN-EMB generalises to remote homologs, yielding stable AUC-ROC across all three test sets with less than 30% sequence identity to the training dataset, and even to proteins with less than 15% sequence identity.
Availability
Webserver, source code and datasets at www.ibi.vu.nl/programs/pipennemb/