Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust

Benjamin Buchfink
Haim Ashkenazy
Klaus Reuter
John A. Kennedy
Hajk-Georg Drost

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

The biosphere genomics era is transforming life science research, but existing methods struggle to efficiently reduce the vast dimensionality of the protein universe. We present DIAMOND DeepClust, an ultra-fast cascaded clustering method optimized to cluster the 19 billion protein sequences currently defining the protein biosphere. As a result, we detect 1.7 billion clusters of which 32% hold more than one sequence. This means that 544 million clusters represent 94% of all known proteins, illustrating that clustering across the tree of life can significantly accelerate comparative studies in the Earth BioGenome era.

Arcadia Science
Apr 14, 2023

We presentDIAMOND DeepClust, an ultra-fast cascaded clustering method optimized to cluster the19 billion protein sequences currently defining the protein biosphere

It would be super helpful here to point out where these protein sequences come from -- NCBI nr, mmseqs sets, etc.

Read the original source
Arcadia Science
Apr 14, 2023

we identified the ability tocluster this vast protein sequence diversity space as a key factor currently limiting theassociation of sequences across large sets of divergent species

Can you add a few more details on how you identified this? why are tools in mmseqs2 not sufficient here? what innovation is needed to overcome whatever barriers exist?

Read the original source
Arcadia Science
Apr 14, 2023

species

is each of the 1.8 million genomes to come from a separate species, or will some be different strains of the same species?

Read the original source
Arcadia Science
Apr 14, 2023

Current protein clustering approachesimplemented in the standard tools CD-hit13, UClust14, and Linclust15 are limited when aiming tocluster billions of proteins with such broad sequence diversity in reasonable time and withsufficient clustering sensitivity at lower identity-boundarie

What are the limitations? why won't something like linclust work here?

Read the original source
Arcadia Science
Apr 14, 2023

18 days on 27 high

what was the RAM usage?

Read the original source
Arcadia Science
Apr 14, 2023

18.1 million CPU hours compared to 194 million CPU hours withMMSeqs2 which makes this computation feasible today on existing HPC systems (Methods)

This doesn't feel like that big of a difference...yes, mmseqs2 would take 10x as long, but still feels like it could be accomplished on current compute infrastructure. If that's not true, I think it would be beneficial to highlight that.

Read the original source
Arcadia Science
Apr 14, 2023

these ~1.16 billion unique sequences comprise only ~6% of the full set of 19billion sequences

Do they come from weird taxonomies too? or metagenomes or something?

Read the original source
Arcadia Science
Apr 14, 2023

Finally, we designed a re-clustering procedure that allows users to add newsequences to a large collection of existing clusters so that the sequencing and assemblycommunity can swiftly add incoming sequences to our biosphere cluster database without theneed to re-cluster the entire dataset (Methods)

How does this impact cluster membership? this feels akin to the 16s debate of OTUs vs. ASVs

Read the original source
Arcadia Science
Apr 14, 2023

AlthoughMMSeqs2/Linclust15 presented a considerable advancement over Cd-hit and UClust, it stillsuffers from comparatively low performance when clustering at high alignment sensitivity,thereby introducing an analytics bottleneck when attempting to scale to >27 billion estimated

but why? what is your methodological advancement that overcomes this?

Read the original source
Arcadia Science
Apr 14, 2023

19 billion sequences

What are the database sources for these sequences? are they only eukaryotic, or do they include bacteria and archaea too?

Read the original source
Arcadia Science
Apr 14, 2023

can be compressed into 335 million centroids for downstream analyses (Supplementaryfig. 10)

It might be good to highlight that this is on the same order as the current size of NCBI nr (or at least I think it is), meaning our algorithms can already handle searches at this scale

Read the original source
Arcadia Science
Apr 14, 2023

30% sequence identity

Why was this sequence identity selected?

Read the original source
Arcadia Science
Apr 14, 2023

clusterable homologs found

Where are the gold standard set of homologs defined?

Read the original source
Arcadia Science
Apr 14, 2023

Experimental Study

but this doesn't exist yet right? I think it would be good to clarify that here

Read the original source
Arcadia Science
Apr 14, 2023

optimality of cluster assignment

How is this calculated?

Read the original source
Arcadia Science
Apr 14, 2023

Fig. 1

is it possible to increase the font text size in this figure? it is very difficult to read

Read the original source
Arcadia Science
Apr 14, 2023

In the first round, we subsample the seed space using minimizers with awindow size of 1219which we empirically found to provide a good balance between speed andsensitivity, and attempt to achieve linear computational scaling of comparisons by consideringonly seed hits against the longest sequence for identical seeds rather than trialing all possiblecombinations15

Did you consider using UKHS or something like it here? https://kingsfordlab.cbd.cmu.edu/publication/orenstein-2016-compactkmers/

Read the original source
Arcadia Science
Apr 14, 2023

Fig. 2

Can you increase the font size for panels C and D and the y axis on panels A and B?

Read the original source
Arcadia Science
Apr 14, 2023

We hard-masked thisdatabase using tantan21 with default settings and removed all sequences that were masked over>10% of their range, resulting in a reduced database of 445,610,930 sequences.

What types of sequences did this remove? what biological biases are introduced here over using full nr?

Read the original source
Arcadia Science
Apr 14, 2023

Supplementary fig. 2

Can you make the fonts larger in these figures? This figure is sort of confusing with its separation into panels. I think it would be better to have a single figure with the matching color scheme used for earlier figures in the manuscript (one color per tool), and then draw a darker line for 0.05 error rate, or put an asterisk on the y axis for tools that maintain below that. Same feedback for Fig S3-S7

Read the original source
Arcadia Science
Apr 14, 2023

Supplementary fig. 1

Can you make the fonts larger in these panels?

Read the original source
Arcadia Science
Apr 14, 2023

We established the ground truth for these evaluations by computing a full Smith Watermanalignment of the evaluated centroid or cluster member sequences against all centroidsequences using DIAMOND in --swipe mode which guarantees perfect pairwise alignmentsensitivity.

Do you think this produces a gold standard ground truth, or will there still be error here? I think highlight possible sources of error here could be helpful for the reader to understand evaluation limitations

Read the original source
Arcadia Science
Apr 14, 2023

22,788,215,153

Can you estimate what fraction of these are overlapping and 100% redundant?

Read the original source
Arcadia Science
Apr 14, 2023

Data availability

I don't see this section in the preprint, are the databases available now?

Read the original source
Arcadia Science
Mar 8, 2023

Data availability

I don't see this section in the preprint, are the databases available now?

Read the original source
Arcadia Science
Mar 8, 2023

22,788,215,153

Can you estimate what fraction of these are overlapping and 100% redundant?

Read the original source
Arcadia Science
Mar 8, 2023

We established the ground truth for these evaluations by computing a full Smith Watermanalignment of the evaluated centroid or cluster member sequences against all centroidsequences using DIAMOND in --swipe mode which guarantees perfect pairwise alignmentsensitivity.

Do you think this produces a gold standard ground truth, or will there still be error here? I think highlight possible sources of error here could be helpful for the reader to understand evaluation limitations

Read the original source
Arcadia Science
Mar 8, 2023

Supplementary fig. 2

Can you make the fonts larger in these figures? This figure is sort of confusing with its separation into panels. I think it would be better to have a single figure with the matching color scheme used for earlier figures in the manuscript (one color per tool), and then draw a darker line for 0.05 error rate, or put an asterisk on the y axis for tools that maintain below that. Same feedback for Fig S3-S7

Read the original source
Arcadia Science
Mar 8, 2023

We hard-masked thisdatabase using tantan21 with default settings and removed all sequences that were masked over>10% of their range, resulting in a reduced database of 445,610,930 sequences.

What types of sequences did this remove? what biological biases are introduced here over using full nr?

Read the original source
Arcadia Science
Mar 8, 2023

Supplementary fig. 1

Can you make the fonts larger in these panels?

Read the original source
Arcadia Science
Mar 8, 2023

In the first round, we subsample the seed space using minimizers with awindow size of 1219which we empirically found to provide a good balance between speed andsensitivity, and attempt to achieve linear computational scaling of comparisons by consideringonly seed hits against the longest sequence for identical seeds rather than trialing all possiblecombinations15

Did you consider using UKHS or something like it here? https://kingsfordlab.cbd.cmu.edu/publication/orenstein-2016-compactkmers/

Read the original source
Arcadia Science
Mar 8, 2023

Fig. 2

Can you increase the font size for panels C and D and the y axis on panels A and B?

Read the original source
Arcadia Science
Mar 8, 2023

optimality of cluster assignment

How is this calculated?

Read the original source
Arcadia Science
Mar 8, 2023

clusterable homologs found

Where are the gold standard set of homologs defined?

Read the original source
Arcadia Science
Mar 8, 2023

Fig. 1

is it possible to increase the font text size in this figure? it is very difficult to read

Read the original source
Arcadia Science
Mar 8, 2023

AlthoughMMSeqs2/Linclust15 presented a considerable advancement over Cd-hit and UClust, it stillsuffers from comparatively low performance when clustering at high alignment sensitivity,thereby introducing an analytics bottleneck when attempting to scale to >27 billion estimated

but why? what is your methodological advancement that overcomes this?

Read the original source
Arcadia Science
Mar 8, 2023

Experimental Study

but this doesn't exist yet right? I think it would be good to clarify that here

Read the original source
Arcadia Science
Mar 8, 2023

can be compressed into 335 million centroids for downstream analyses (Supplementaryfig. 10)

It might be good to highlight that this is on the same order as the current size of NCBI nr (or at least I think it is), meaning our algorithms can already handle searches at this scale

Read the original source
Arcadia Science
Mar 8, 2023

18.1 million CPU hours compared to 194 million CPU hours withMMSeqs2 which makes this computation feasible today on existing HPC systems (Methods)

This doesn't feel like that big of a difference...yes, mmseqs2 would take 10x as long, but still feels like it could be accomplished on current compute infrastructure. If that's not true, I think it would be beneficial to highlight that.

Read the original source
Arcadia Science
Mar 8, 2023

these ~1.16 billion unique sequences comprise only ~6% of the full set of 19billion sequences

Do they come from weird taxonomies too? or metagenomes or something?

Read the original source
Arcadia Science
Mar 8, 2023

30% sequence identity

Why was this sequence identity selected?

Read the original source
Arcadia Science
Mar 8, 2023

Finally, we designed a re-clustering procedure that allows users to add newsequences to a large collection of existing clusters so that the sequencing and assemblycommunity can swiftly add incoming sequences to our biosphere cluster database without theneed to re-cluster the entire dataset (Methods)

How does this impact cluster membership? this feels akin to the 16s debate of OTUs vs. ASVs

Read the original source
Arcadia Science
Mar 8, 2023

18 days on 27 high

what was the RAM usage?

Read the original source
Arcadia Science
Mar 8, 2023

19 billion sequences

What are the database sources for these sequences? are they only eukaryotic, or do they include bacteria and archaea too?

Read the original source
Arcadia Science
Mar 8, 2023

Current protein clustering approachesimplemented in the standard tools CD-hit13, UClust14, and Linclust15 are limited when aiming tocluster billions of proteins with such broad sequence diversity in reasonable time and withsufficient clustering sensitivity at lower identity-boundarie

What are the limitations? why won't something like linclust work here?

Read the original source
Arcadia Science
Mar 8, 2023

species

is each of the 1.8 million genomes to come from a separate species, or will some be different strains of the same species?

Read the original source
Arcadia Science
Mar 8, 2023

we identified the ability tocluster this vast protein sequence diversity space as a key factor currently limiting theassociation of sequences across large sets of divergent species

Can you add a few more details on how you identified this? why are tools in mmseqs2 not sufficient here? what innovation is needed to overcome whatever barriers exist?

Read the original source
Arcadia Science
Mar 8, 2023

We presentDIAMOND DeepClust, an ultra-fast cascaded clustering method optimized to cluster the19 billion protein sequences currently defining the protein biosphere

It would be super helpful here to point out where these protein sequences come from -- NCBI nr, mmseqs sets, etc.

Read the original source
Version published to 10.1101/2023.01.24.525373 on bioRxiv
Jan 25, 2023

Domain-combination based pan-genomic Tree of Life

This article has 3 authors:
1. Rubén García-Domínguez
2. Carlos Santana-Molina
3. Damien P. Devos
This article has no evaluationsLatest version Oct 16, 2025
Hydroplane I: one-shot probabilistic evolutionary analysis for scalable organizational identification

This article has 2 authors:
1. Connah G. M. Johnson
2. David D. Pollock
This article has no evaluationsLatest version Oct 15, 2025
CluSeek: Bioinformatics Tool to Identify and Analyze Gene Clusters

This article has 9 authors:
1. Ondrej Hrebicek
2. Stanislav Kadlcik
3. Lucie Najmanova
4. Jiri Janata
5. Jana Kamanova
6. Lada Hanzlikova
7. Marketa Koberska
8. Vojtech Kovarovic
9. Zdenek Kamenik
This article has no evaluationsLatest version Sep 18, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Domain-combination based pan-genomic Tree of Life

Hydroplane I: one-shot probabilistic evolutionary analysis for scalable organizational identification

CluSeek: Bioinformatics Tool to Identify and Analyze Gene Clusters