How suitable are clustering methods for functional annotation of proteins?

Rakesh Busi
Pranav Machingal
Nandyala Hemachandra
Petety V. Balaji

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The advent of affordable high-throughput genome sequencing has drastically expanded protein sequence databases, necessitating the development of computational tools to predict protein function from sequence data. Current methods, such as BLASTp and profile HMMs, while effective, are limited by difficulties in detecting remote homologs and uncertainties in multiple sequence alignments. To address this, we explore the use of clustering algorithms for unsupervised protein function annotation, using pseudo-amino acid composition (PAAC) as features.

In this study, we evaluated nine clustering algorithms for their ability to segregate protein sequences based on functional differences using the PAAC feature. Using intrinsic metrics, particularly the silhouette coefficient (SC), we determined the optimal number of clusters ( k ) for each algorithm. We observed that agglomerative clustering produced results resembling phylogenetic relationships; even k-means clustering, Gaussian mixture model(GMM), and spectral clustering do so but occasionally merge datapoints from distinct original clusters at higher k values.

Our findings reveal that k-means clustering, GMM, and agglomerative clustering effectively segregate distinct protein functional families, but effectiveness decreases when distinguishing fine-grained functional differences. Notably, spectral clustering underperformed relative to other methods. Affinity propagation clustering, while effective in some cases, generated more clusters than expected and is prone to false positives. Overall, we find that some of the clustering algorithms are suitable for functional annotation of protein sequences using PAAC as a feature set, even when the number of ground-truth sequences is limited.

The implementation of the clustering method for protein sequences is available in the GitHub repository linked below. It provides comprehensive steps for preprocessing, feature extraction, clustering, and evaluation. All results are presented in a Jupyter Notebook. https://github.com/RakeshBusi/Clustering

Version published to 10.1101/2024.12.26.630370v1 on bioRxiv
Dec 26, 2024

Snekmer Learn/Apply: A kmer-based vector similarity approach to protein classification suitable for metagenomic datasets

This article has 8 authors:
1. Tara A. Nitka
2. Jeremy Jacobson
3. Christine H Chang
4. Genevieve R. Krause
5. Travis J. Wheeler
6. Robert G. Egbert
7. William C Nelson
8. Jason E McDermott
This article has no evaluationsLatest version May 18, 2025
PMScanR: an R package for the large-scale identification, analysis, and visualization of protein motifs

This article has 5 authors:
1. Jan Pawel Jastrzebski
2. Monika Gawronska
3. Wiktor Babis
4. Miriana Quaranta
5. Damian Czopek
This article has no evaluationsLatest version May 27, 2025
Multiresolution Clustering of Genomic Data

This article has 2 authors:
1. Ali Turfah
2. Xiaoquan Wen
This article has no evaluationsLatest version Jun 18, 2025

Listed in

Abstract

Article activity feed

Related articles

Snekmer Learn/Apply: A kmer-based vector similarity approach to protein classification suitable for metagenomic datasets

PMScanR: an R package for the large-scale identification, analysis, and visualization of protein motifs

Multiresolution Clustering of Genomic Data