ET-Pfam: Ensemble transfer learning for protein family prediction

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Motivation

Due to the rapid growth of sequence generation, which has surpassed the expert curators ability to manually review and annotate them, the computational annotation of proteins remains a significant challenge in bioinformatics nowadays. The Pfam database contains a large collection of proteins that are nowadays annotated with domain families through multiple sequence alignments and profile Hidden Markov models (pHMMs). However, such computational annotation methods have some limitations such as problems for han-dling large datasets and the fact that multiple sequence alignments are computationally challenging to compute with high accuracy due to the increase in complexity as the number of sequences and lengths grow. Additionally, each HMM is independently obtained for each family missing the opportunity of learning patterns across families, that is from a complete view of all the dataset. As an alternative, some deep learning (DL) models have been recently proposed, nevertheless with simple representations of the inputs and moderate improvements in performance.

Results

In this work we present ET-Pfam, a novel approach based on transfer learning and ensembles of multiple DL classifiers to predict functional families in the Pfam database. Several base DL models are first trained using learned representations from a protein large language model, with different hyperparameters to increase diversity. Then, the base models are integrated using classical ensemble strategies and novel voting approaches by learning weights for each model and for each Pfam family. Results demonstrate that the proposed ET-Pfam method can consistently diminish classification error rates compared to individual DL models and to pHMM models, boosting prediction performance. Among the strategies presented here, the learned weights by family voting achieved the best performance, with the lowest error rate (7.00%), significantly surpassing the best individual model error (12.91%) and the state-of-the-art pHMM error (29.28%) on the same Pfam dataset.

Availability

Data and source code are available at https://github.com/sinc-lab/ET-Pfam .

Article activity feed