GNN2Pfam: Integrating protein sequence and structure with graph neural networks for Pfam domain annotation

Emilio Fenoy
Leandro A. Bugnon
Rosario Vitale
Sofia A. Duarte
Diego H. Milone
Georgina Stegmayer

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The challenge of establishing the relationship between protein sequences and their function cannot yet be considered completely solved. State-of-the-art annotation of Pfam domains is based on hidden Markov models (HMMs) built from hand-crafted sequence alignments. However, while this approach has been highly successful during the last decades since its proposal, there is yet a very large number of proteins that remain unannotated because there is no possible alignment to already known and functionally characterized sequences, or HMM fails to discriminate between similar domains. Adding structural information using deep and graph neural networks (GNNs) presents an opportunity to build upon existing models in those more challenging cases. GNNs excel at capturing complex relationships in data and can learn a model that shares information across all existing families, thus being able to generalize Pfam domain predictions to novel sequences. In this work we propose GNN2Pfam, an end-to-end GNN-based method for Pfam family domain annotation. Our strategy allows one single model to be trained for all species and families. This novel proposal uses the protein 3D structure together with a sequence representation obtained from a large pre-trained model. The GNN2Pfam model is based on a graph derived from amino acid interactions in the 3D structure, learning both sequential and structural features from this representation. Experiments show that the proposed GNN-based model can clearly outperform the HMM state-of-the-art predictive performance in Pfam domains annotations. These results suggest that GNN models can be the key component of future protein annotation tools. Data and source code are available at https://github.com/efenoy/GNN2Pfam .

Version published to 10.1101/2025.09.18.677074 on bioRxiv
Sep 21, 2025

Pretrained protein language models choose between sequence novelty and structural completeness

This article has 3 authors:
1. Arjuna M. Subramanian
2. Zachary A. Martinez
3. Matt Thomson
This article has no evaluationsLatest version Oct 3, 2025
Protein Dimension DB: A Unified Protein Repository for Representation Learning and Functional Analysis

This article has 3 authors:
1. Pitágoras de Azevedo Alves Sobrinho
2. Tetsu Sakamoto
3. Wilfredo Blanco Figuerola
This article has no evaluationsLatest version Oct 1, 2025
DomDiff: protein family and domain annotation via diffusion model and ESM2 embedding

This article has 3 authors:
1. Chao Zhang
2. Haopeng Xia
3. Peng Yin
This article has no evaluationsLatest version Oct 28, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Pretrained protein language models choose between sequence novelty and structural completeness

Protein Dimension DB: A Unified Protein Repository for Representation Learning and Functional Analysis

DomDiff: protein family and domain annotation via diffusion model and ESM2 embedding