CATHe2: Enhanced CATH Superfamily Detection Using ProstT5 and Structural Alphabets

Orfeú Mouret
Jad Abbass

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Motivation

The CATH database is a free publicly available online resource that provides annotations about the evolutionary and structural relationships of protein domains. Due to the flux of protein structures coming mainly from the recent breakthrough of AlphaFold and therefore the non-feasibility of manual intervention, the CATH team recently developed an automatic CATH superfamily classifier called CATHe, that uses a feed-forward network classifier with protein Language Model (pLM) embeddings as input. Using the same dataset, in this paper, we present, CATHe2 that improves on CATHe by switching the old pLM ProtT5 for one of the most recent versions called ProstT5, and by introducing domain 3D information as input to the classifier, in the form of Structural Alphabet representation, namely 3Di sequence embeddings. Finally, CATHe2 implements a new version of the feed-forward network (FNN, i.e, non-recurrent neural network) classifier architecture, fine-tuned to perform at the CATH superfamily prediction task.

Results

The best CATHe2 model reaches an accuracy of 92.2 ± 0.7% with an F1 score of 82.3 ± 1.3% which constitutes an improvement of 9.9% on the F1 score and 6.6% on the accuracy, from the previous CATHe version (85.6 ± 0.4% accuracy and 72.4 ± 0.7% F1 score) on its largest dataset (~ 1700 superfamilies). This model uses ProstT5 AA sequence and 3Di sequence embeddings as input to the classifier, but a simplified version requiring only AA sequences, already improves CATHe’s F1 score by 6.7 ± 1.3% and accuracy by 6.6 ± 0.7% on its largest dataset.

Availability & Implementation

The code is available on https://GitHub.com/Mouret-Orfeu/CATHe2 . Datasets: https://doi.org/10.5281/zenodo.14534966

Contact

orfeu.mouret.pro@outlook.fr , j.abbass@kingston.ac.uk

Version published to 10.1101/2025.06.22.660903v1 on bioRxiv
Jun 26, 2025

AstraROLE & AstraSUIT: Multi-Task Annotation Models for Functional Profiling of Proteins

This article has 3 authors:
1. Çağlar Bozkurt
2. Alexandra Vasilyeva
3. Aniruddh Goteti
This article has no evaluationsLatest version Jun 26, 2025
ProtFun: A Protein Function Prediction Model Using Graph Attention Networks with a Protein Large Language Model

This article has 2 authors:
1. Muhammed Talo
2. Serdar Bozdag
This article has no evaluationsLatest version May 17, 2025
AntiCP3: Prediction of Anticancer Proteins Using Evolutionary Information from Protein Language Models

This article has 4 authors:
1. Amisha Gupta
2. Milind Chauhan
3. Ritu Tomer
4. G.P.S. Raghava
This article has no evaluationsLatest version May 3, 2025

Listed in

Abstract

Motivation

Results

Availability & Implementation

Contact

Article activity feed

Related articles

AstraROLE & AstraSUIT: Multi-Task Annotation Models for Functional Profiling of Proteins

ProtFun: A Protein Function Prediction Model Using Graph Attention Networks with a Protein Large Language Model

AntiCP3: Prediction of Anticancer Proteins Using Evolutionary Information from Protein Language Models