How suitable are clustering methods for functional annotation of proteins?

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The advent of affordable high-throughput genome sequencing has drastically expanded protein sequence databases, necessitating the development of computational tools to predict protein function from sequence data. Current methods, such as BLASTp and profile HMMs, while effective, are limited by difficulties in detecting remote homologs and uncertainties in multiple sequence alignments. To address this, we explore the use of clustering algorithms for unsupervised protein function annotation, using pseudo-amino acid composition (PAAC) as features.

In this study, we evaluated nine clustering algorithms for their ability to segregate protein sequences based on functional differences using the PAAC feature. Using intrinsic metrics, particularly the silhouette coefficient (SC), we determined the optimal number of clusters ( k ) for each algorithm. We observed that agglomerative clustering produced results resembling phylogenetic relationships; even k-means clustering, Gaussian mixture model (GMM), and spectral clustering do so, but occasionally merge datapoints from distinct original clusters at higher k values.

Our findings reveal that k-means clustering, GMM, and agglomerative clustering effectively segregate distinct protein functional families, but effectiveness decreases when distinguishing fine-grained functional differences. Notably, spectral clustering underperformed relative to other methods. Affinity propagation clustering, while effective in some cases, generated more clusters than expected and is prone to false positives. Overall, we find that some of the clustering algorithms are suitable for functional annotation of protein sequences using PAAC as a feature set, even when the number of ground-truth sequences is limited.

The implementation of the clustering method for protein sequences is available in the GitHub repository ( https://github.com/RakeshBusi/Clustering ). It provides comprehensive steps for preprocessing, feature extraction, clustering, and evaluation. All steps are presented in a Jupyter Notebook in the repository.

Author Summary

We are in the age of big data. It is an outcome of the development of high-throughput techniques. The resources spent to develop and deploy such techniques are considerably large. However, data by itself is not an end but a means to answer questions of relevance. Hence, the development and/or customisation of techniques that help us to interpret and utilise data are also important. In this study, we focus on customising a popular technique, namely clustering, to extract biological information from the ever-growing protein sequence database. We test the suitability of nine clustering algorithms to determine a protein’s molecular function solely based on its amino acid sequence. Based on our findings, we recommend using a combination of the four algorithms, namely, k-means, Gaussian mixture model, agglomerative, and affinity propagation. However, we note that proteins with subtle functional differences cluster together, and fine-tuning algorithms to separate such proteins requires additional experimental data.

Article activity feed