MKFGO: Integrating Multi-Source Knowledge Fusion with Pre-Trained Language Model for High-Accuracy Protein Function Prediction

Yi-Heng Zhu
Shuxin Zhu
Xuan Yu
He Yan
Yan Liu
Xiaojun Xie
Dong-Jun Yu
Rui Ye

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Accurately identifying protein functions is essential to understand life mechanisms and thus advance drug discovery. Although biochemical experiments are the gold standard for determining protein functions, they are often time-consuming and labor-intensive. Here, we proposed a novel composite deep-learning method, MKFGO, to infer Gene Ontology (GO) attributes through integrating five complementary pipelines built on multi-source biological data. MKFGO was rigorously benchmarked on 1522 non-redundant proteins, demonstrating superior performance over 11 state-of-the-art function prediction methods. Comprehensive data analyses revealed that the major advantage of MKFGO lies in its two deep-learning components, HFRGO and PLMGO, which derive handcraft features and protein large language model (PLM)-based features, respectively, from protein sequences in different biological views, with effective knowledge fusion at the decision-level. HFRGO leverages an LSTM-attention network embedded with handcraft features, in which the triplet loss-based guilt-by-association strategy is designed to enhance the correlation between feature similarity and function similarity. PLMGO employs the PLM to capture feature embeddings with discriminative functional patterns from sequences. Meanwhile, another three components provide complementary insights for further improving prediction accuracy, driven by protein-protein interaction, GO term probability, and protein-coding gene sequence, respectively. The source codes and models of MKFGO are freely available at https://github.com/yiheng-zhu/MKFGO.

Version published to 10.1101/2025.03.27.645685v1 on bioRxiv
Apr 1, 2025

Extending Prot2Token: Aligning Protein Language Models for Unified and Diverse Protein Prediction Tasks

This article has 7 authors:
1. Mahdi Pourmirzaei
2. Ye Han
3. Farzaneh Esmaili
4. Mohammadreza Pourmirzaei
5. Salhuldin Alqarghuli
6. Kai Chen
7. Dong Xu
This article has no evaluationsLatest version Mar 11, 2025
ProtDAT: A Unified Multimodal Cross-Attention Framework for Ab-Initio Amino Acid Sequence Design from Any Protein Text Description

This article has 5 authors:
1. Hongbin Shen
2. Xiaoyu Guo
3. Yi-Fan Li
4. Yuan Liu
5. Xiaoyong Pan
This article has no evaluationsLatest version Mar 10, 2025
SEHI-PPI: An End-to-End Sampling-Enhanced Human-Influenza Protein-Protein Interaction Prediction Framework with Double-View Learning

This article has 8 authors:
1. Qiang Yang
2. Xiao Fan
3. Haiqing Zhao
4. Zhe Ma
5. Megan Stanifer
6. Jiang Bian
7. Marco Salemi
8. Rui Yin
This article has no evaluationsLatest version Mar 12, 2025

Listed in

Abstract

Article activity feed

Related articles

Extending Prot2Token: Aligning Protein Language Models for Unified and Diverse Protein Prediction Tasks

ProtDAT: A Unified Multimodal Cross-Attention Framework for Ab-Initio Amino Acid Sequence Design from Any Protein Text Description

SEHI-PPI: An End-to-End Sampling-Enhanced Human-Influenza Protein-Protein Interaction Prediction Framework with Double-View Learning