GNN2Pfam: Integrating protein sequence and structure with graph neural networks for Pfam domain annotation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The challenge of establishing the relationship between protein sequences and their function cannot yet be considered completely solved. State-of-the-art annotation of Pfam domains is based on hidden Markov models (HMMs) built from hand-crafted sequence alignments. However, while this approach has been highly successful during the last decades since its proposal, there is yet a very large number of proteins that remain unannotated because there is no possible alignment to already known and functionally characterized sequences, or HMM fails to discriminate between similar domains. Adding structural information using deep and graph neural networks (GNNs) presents an opportunity to build upon existing models in those more challenging cases. GNNs excel at capturing complex relationships in data and can learn a model that shares information across all existing families, thus being able to generalize Pfam domain predictions to novel sequences. In this work we propose GNN2Pfam, an end-to-end GNN-based method for Pfam family domain annotation. Our strategy allows one single model to be trained for all species and families. This novel proposal uses the protein 3D structure together with a sequence representation obtained from a large pre-trained model. The GNN2Pfam model is based on a graph derived from amino acid interactions in the 3D structure, learning both sequential and structural features from this representation. Experiments show that the proposed GNN-based model can clearly outperform the HMM state-of-the-art predictive performance in Pfam domains annotations. These results suggest that GNN models can be the key component of future protein annotation tools. Data and source code are available at https://github.com/efenoy/GNN2Pfam .