Cluster Analysis for Protein Sequences
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper presents a comprehensive analysis of MMseqs2 clusters and traditional machine learning (ML) clustering algorithms, including KMeans and Hierarchical clusterings, in terms of protein sequences. The analyses are validated experimentally. The cluster analyses have been performed in the A stral Compendium protein sequences dataset hosted in the SCOPe database. The dataset is embedded using two pre-trained transformer models using Evolutionary Scale Modeling (ESM) to perform KMeans and Hierarchical clustering algorithms. Afterward, those four clusters are compared with MMseqs2/Linclust and MMseqs2/easy-cluster methods. After performing the experiment, MMseqs2/Linclust and MMseqs2/easy-cluster outperform traditional machine learning cluster algorithms by a considerable margin. This analysis demonstrates the superiority of the MMseqs2 clustering techniques over conventional machine learning clustering algorithms. The source code of the experiment is publicly available and readily accessible through: https://github.com/mrzResearchArena/protein-clustering.