Unraveling Protein Secrets: Machine Learning Unveils Novel Biologically Significant Associations Among Amino Acids

Samuel Kakraba
Aayire Clement Yadem
Kuukua Egyinba Abraham

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Hierarchical clustering of amino acids using multidimensional molecular descriptors reveals both established and novel structure-function relationships, advancing traditional classification schemes. We developed an automated clustering pipeline leveraging 22 graph-theoretic descriptors for all 20 standard amino acids, integrating parameter optimization, consensus validation, and robust statistical evaluation. Average linkage with cityblock distance achieved the highest cophenetic correlation (0.847), indicating superior preservation of pairwise relationships compared to other methods. Cluster validation metrics (silhouette: 0.573, Calinski-Harabasz: 21.45, Davies-Bouldin: 0.82) and the gap statistic consistently supported a two-cluster solution, with the dendrogram and consensus clustering revealing stable, biologically meaningful substructure. The analysis identified two dominant clusters: one comprising aromatic residues (tryptophan, phenylalanine, tyrosine) and positively charged residues (arginine, histidine, lysine), and a second encompassing aliphatic, polar, and acidic amino acids. High-stability associations (consensus >0.85) were observed for the aromatic cluster and branched aliphatic group (isoleucine, valine, leucine), while glycine and proline emerged as pronounced outliers with low co-clustering probabilities (< 0.3), reflecting their unique structural roles. Notably, arginine showed unexpectedly high consensus with aromatic residues, suggesting a functional basis in cation–π interactions, and methionine occupied an intermediate position between hydrophobic and sulfur-containing groups. Comparative analysis demonstrated that hierarchical clustering outperformed k-means and DBSCAN in both cluster quality and biological interpretability. These findings both corroborate and refine existing amino acid classifications, highlighting the power of multidimensional descriptor-based clustering to uncover subtle biochemical relationships. The resulting hierarchy provides a robust framework for predicting mutation effects, guiding protein engineering, and informing reduced amino acid alphabets for structural modeling.

Version published to 10.20944/preprints202505.0139.v1
May 6, 2025

Feature-Optimized Machine Learning Benchmarking for Protein Interface Prediction in Permanent Homodimer Complexes with Distinct Structural Features

This article has 4 authors:
1. Tayyip Topuz
2. Zeki Erdem
3. Halil Bisgin
4. E. Demet Akten
This article has no evaluationsLatest version Feb 2, 2026
Integrating Evolutionary and Compositional Features with ML and DL for Robust and Interpretable Druggable Protein Prediction

This article has 5 authors:
1. Mujeebu Rehman
2. Qinghua Liu
3. Muhammad Javed
4. Ali Ghulam
5. Teerath Kumar
This article has no evaluationsLatest version Dec 11, 2025
Unlocking the genomic landscape for antimicrobial domain discovery with a two-stage progressive residue-level annotation model

This article has 13 authors:
1. Peilin Xie
2. Xingchen Liu
3. Lantian Yao
4. Zhihao Zhao
5. Anming Yang
6. Jiahui Guan
7. Zijun Jiao
8. Zhihong Liu
9. Junwen Wang
10. Tzong-Yi Lee
11. Zigang Li
12. Bingyu Cui
13. Ying-Chih Chiang
This article has no evaluationsLatest version Dec 11, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Feature-Optimized Machine Learning Benchmarking for Protein Interface Prediction in Permanent Homodimer Complexes with Distinct Structural Features

Integrating Evolutionary and Compositional Features with ML and DL for Robust and Interpretable Druggable Protein Prediction

Unlocking the genomic landscape for antimicrobial domain discovery with a two-stage progressive residue-level annotation model