MKFGO: Integrating Multi-Source Knowledge Fusion with Pre-Trained Language Model for High-Accuracy Protein Function Prediction
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurately identifying protein functions is essential to understand life mechanisms and thus advance drug discovery. Although biochemical experiments are the gold standard for determining protein functions, they are often time-consuming and labor-intensive. Here, we proposed a novel composite deep-learning method, MKFGO, to infer Gene Ontology (GO) attributes through integrating five complementary pipelines built on multi-source biological data. MKFGO was rigorously benchmarked on 1522 non-redundant proteins, demonstrating superior performance over 11 state-of-the-art function prediction methods. Comprehensive data analyses revealed that the major advantage of MKFGO lies in its two deep-learning components, HFRGO and PLMGO, which derive handcraft features and protein large language model (PLM)-based features, respectively, from protein sequences in different biological views, with effective knowledge fusion at the decision-level. HFRGO leverages an LSTM-attention network embedded with handcraft features, in which the triplet loss-based guilt-by-association strategy is designed to enhance the correlation between feature similarity and function similarity. PLMGO employs the PLM to capture feature embeddings with discriminative functional patterns from sequences. Meanwhile, another three components provide complementary insights for further improving prediction accuracy, driven by protein-protein interaction, GO term probability, and protein-coding gene sequence, respectively. The source codes and models of MKFGO are freely available at https://github.com/yiheng-zhu/MKFGO.