ET-Pfam: Ensemble transfer learning for protein family prediction

Sofia Escudero
Sofia Duarte
Rosario Vitale
Emilio Fenoy
Leandro Bugnon
Diego H. Milone
Georgina Stegmayer

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Motivation

Due to the rapid growth of sequence generation, which has surpassed the expert curators ability to manually review and annotate them, the computational annotation of proteins remains a significant challenge in bioinformatics nowadays. The Pfam database contains a large collection of proteins that are nowadays annotated with domain families through multiple sequence alignments and profile Hidden Markov models (pHMMs). However, such computational annotation methods have some limitations such as problems for han-dling large datasets and the fact that multiple sequence alignments are computationally challenging to compute with high accuracy due to the increase in complexity as the number of sequences and lengths grow. Additionally, each HMM is independently obtained for each family missing the opportunity of learning patterns across families, that is from a complete view of all the dataset. As an alternative, some deep learning (DL) models have been recently proposed, nevertheless with simple representations of the inputs and moderate improvements in performance.

Results

In this work we present ET-Pfam, a novel approach based on transfer learning and ensembles of multiple DL classifiers to predict functional families in the Pfam database. Several base DL models are first trained using learned representations from a protein large language model, with different hyperparameters to increase diversity. Then, the base models are integrated using classical ensemble strategies and novel voting approaches by learning weights for each model and for each Pfam family. Results demonstrate that the proposed ET-Pfam method can consistently diminish classification error rates compared to individual DL models and to pHMM models, boosting prediction performance. Among the strategies presented here, the learned weights by family voting achieved the best performance, with the lowest error rate (7.00%), significantly surpassing the best individual model error (12.91%) and the state-of-the-art pHMM error (29.28%) on the same Pfam dataset.

Availability

Data and source code are available at https://github.com/sinc-lab/ET-Pfam .

Version published to 10.1101/2025.08.02.668111 on bioRxiv
Aug 2, 2025

From Dataset Curation to Unified Evaluation: Revisiting Structure Prediction Benchmarks with PXMeter

This article has 6 authors:
1. Wenzhi Ma
2. Zhenyu Liu
3. Jincai Yang
4. Chan Lu
5. Hanyu Zhang
6. Wenzhi Xiao
This article has no evaluationsLatest version Jul 22, 2025
Data augmentation enables label-specific generation of homologous protein sequences

This article has 3 authors:
1. Lorenzo Rosset
2. Martin Weigt
3. Francesco Zamponi
This article has no evaluationsLatest version Jul 25, 2025
C3PI: Component Puzzle Protein-Protein Interaction Prediction

This article has 3 authors:
1. SeyedMohsen Hosseini
2. G. Brian Golding
3. Lucian Ilie
This article has no evaluationsLatest version Jul 31, 2025

Listed in

Abstract

Motivation

Results

Availability

Article activity feed

Related articles

From Dataset Curation to Unified Evaluation: Revisiting Structure Prediction Benchmarks with PXMeter

Data augmentation enables label-specific generation of homologous protein sequences

C3PI: Component Puzzle Protein-Protein Interaction Prediction