Interpretable Biological Sequence Clustering with i Clust

Simeng Zhang
Xinying Liu
Jun Lou
Mudi Jiang
Zengyou He

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Biological sequence clustering is a fundamental problem in bioinformatics, yet most existing methods mainly optimize clustering quality or efficiency while offering limited insight into why sequences are grouped together. This restricts their usefulness in downstream analysis, where representative sequences and clear cluster boundaries are often needed. To address this issue, we propose i Clust, an interpretable clustering method that characterizes each cluster by a representative prototype and an adaptive radius. By adapting to local sequence structure rather than relying on a single global threshold, i Clust produces clusters that are both meaningful and explainable. A final consolidation step further reduces tiny fragments and improves structural stability. Experiments on simulated and real biological sequence datasets show that i Clust achieves competitive clustering performance while providing clearer cluster-level explanations than conventional threshold-based methods. In addition to its empirical impact as a practical clustering method for biological sequences, this article opens up new avenues for developing biological sequence clustering approaches from the viewpoint of interpretable machine learning.

Version published to 10.64898/2026.04.13.718335 on bioRxiv
Apr 16, 2026

RapCluster: Bridging the Reproducibility Gap in Clustering Analysis

This article has 4 authors:
1. Ahmad Lutfi
2. Robert Warneke
3. Lutz Fischer
4. Juri Rappsilber
This article has no evaluationsLatest version Apr 15, 2026
Partner determination from protein sequences using class information with CLAPP

This article has 5 authors:
1. Lisa Gennai
2. Francesco Caredda
3. Mathieu E. Rebeaud
4. Andrea Pagnani
5. Paolo De Los Rios
This article has no evaluationsLatest version May 11, 2026
Metagenomic-scale analysis of the predicted protein structure universe

This article has 11 authors:
1. Martin Steinegger
2. Jingi Yeo
3. Yewon Han
4. Nicola Bordin
5. Andy Lau
6. Shaun Kandathil
7. Hyunbin Kim
8. Eli Levy Karin
9. Milot Mirdita
10. David Jones
11. Christine Orengo
This article has no evaluationsLatest version Mar 31, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

RapCluster: Bridging the Reproducibility Gap in Clustering Analysis

Partner determination from protein sequences using class information with CLAPP

Metagenomic-scale analysis of the predicted protein structure universe