Enhanced prediction of protein functional identity through the integration of sequence and structural features
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Although over 300 million protein sequences are registered in a reference sequence database, only 0.2% have experimentally determined functions. This suggests that many valuable proteins, potentially catalyzing novel enzymatic reactions, remain undiscovered among the vast number of function-unknown proteins. In this study, we developed a method to predict whether two proteins catalyze the same enzymatic reaction by analyzing sequence and structural similarities, utilizing structural models predicted by AlphaFold2. We performed pocket detection and domain decomposition for each structural model. The similarity between protein pairs was assessed using features such as full-length sequence similarity, domain structural similarity, and pocket similarity. We developed several models using conventional machine learning algorithms and found that the LightGBM-based model outperformed the models. Our method also surpassed existing approaches, including those based solely on full-length sequence similarity and state-of-the-art deep learning models. Feature importance analysis revealed that domain sequence identity, calculated through structural alignment, had the greatest influence on the prediction. Therefore, our findings demonstrate that integrating sequence and structural information improves the accuracy of protein function prediction.