Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Arcadia Science)
Abstract
The biosphere genomics era is transforming life science research, but existing methods struggle to efficiently reduce the vast dimensionality of the protein universe. We present DIAMOND DeepClust, an ultra-fast cascaded clustering method optimized to cluster the 19 billion protein sequences currently defining the protein biosphere. As a result, we detect 1.7 billion clusters of which 32% hold more than one sequence. This means that 544 million clusters represent 94% of all known proteins, illustrating that clustering across the tree of life can significantly accelerate comparative studies in the Earth BioGenome era.
Article activity feed
-
We presentDIAMOND DeepClust, an ultra-fast cascaded clustering method optimized to cluster the19 billion protein sequences currently defining the protein biosphere
It would be super helpful here to point out where these protein sequences come from -- NCBI nr, mmseqs sets, etc.
-
we identified the ability tocluster this vast protein sequence diversity space as a key factor currently limiting theassociation of sequences across large sets of divergent species
Can you add a few more details on how you identified this? why are tools in mmseqs2 not sufficient here? what innovation is needed to overcome whatever barriers exist?
-
species
is each of the 1.8 million genomes to come from a separate species, or will some be different strains of the same species?
-
Current protein clustering approachesimplemented in the standard tools CD-hit13, UClust14, and Linclust15 are limited when aiming tocluster billions of proteins with such broad sequence diversity in reasonable time and withsufficient clustering sensitivity at lower identity-boundarie
What are the limitations? why won't something like linclust work here?
-
18 days on 27 high
what was the RAM usage?
-
18.1 million CPU hours compared to 194 million CPU hours withMMSeqs2 which makes this computation feasible today on existing HPC systems (Methods)
This doesn't feel like that big of a difference...yes, mmseqs2 would take 10x as long, but still feels like it could be accomplished on current compute infrastructure. If that's not true, I think it would be beneficial to highlight that.
-
these ~1.16 billion unique sequences comprise only ~6% of the full set of 19billion sequences
Do they come from weird taxonomies too? or metagenomes or something?
-
Finally, we designed a re-clustering procedure that allows users to add newsequences to a large collection of existing clusters so that the sequencing and assemblycommunity can swiftly add incoming sequences to our biosphere cluster database without theneed to re-cluster the entire dataset (Methods)
How does this impact cluster membership? this feels akin to the 16s debate of OTUs vs. ASVs
-
AlthoughMMSeqs2/Linclust15 presented a considerable advancement over Cd-hit and UClust, it stillsuffers from comparatively low performance when clustering at high alignment sensitivity,thereby introducing an analytics bottleneck when attempting to scale to >27 billion estimated
but why? what is your methodological advancement that overcomes this?
-
19 billion sequences
What are the database sources for these sequences? are they only eukaryotic, or do they include bacteria and archaea too?
-
can be compressed into 335 million centroids for downstream analyses (Supplementaryfig. 10)
It might be good to highlight that this is on the same order as the current size of NCBI nr (or at least I think it is), meaning our algorithms can already handle searches at this scale
-
30% sequence identity
Why was this sequence identity selected?
-
clusterable homologs found
Where are the gold standard set of homologs defined?
-
Experimental Study
but this doesn't exist yet right? I think it would be good to clarify that here
-
optimality of cluster assignment
How is this calculated?
-
Fig. 1
is it possible to increase the font text size in this figure? it is very difficult to read
-
In the first round, we subsample the seed space using minimizers with awindow size of 1219which we empirically found to provide a good balance between speed andsensitivity, and attempt to achieve linear computational scaling of comparisons by consideringonly seed hits against the longest sequence for identical seeds rather than trialing all possiblecombinations15
Did you consider using UKHS or something like it here? https://kingsfordlab.cbd.cmu.edu/publication/orenstein-2016-compactkmers/
-
Fig. 2
Can you increase the font size for panels C and D and the y axis on panels A and B?
-
We hard-masked thisdatabase using tantan21 with default settings and removed all sequences that were masked over>10% of their range, resulting in a reduced database of 445,610,930 sequences.
What types of sequences did this remove? what biological biases are introduced here over using full nr?
-
Supplementary fig. 2
Can you make the fonts larger in these figures? This figure is sort of confusing with its separation into panels. I think it would be better to have a single figure with the matching color scheme used for earlier figures in the manuscript (one color per tool), and then draw a darker line for 0.05 error rate, or put an asterisk on the y axis for tools that maintain below that. Same feedback for Fig S3-S7
-
Supplementary fig. 1
Can you make the fonts larger in these panels?
-
We established the ground truth for these evaluations by computing a full Smith Watermanalignment of the evaluated centroid or cluster member sequences against all centroidsequences using DIAMOND in --swipe mode which guarantees perfect pairwise alignmentsensitivity.
Do you think this produces a gold standard ground truth, or will there still be error here? I think highlight possible sources of error here could be helpful for the reader to understand evaluation limitations
-
22,788,215,153
Can you estimate what fraction of these are overlapping and 100% redundant?
-
Data availability
I don't see this section in the preprint, are the databases available now?
-
Data availability
I don't see this section in the preprint, are the databases available now?
-
22,788,215,153
Can you estimate what fraction of these are overlapping and 100% redundant?
-
We established the ground truth for these evaluations by computing a full Smith Watermanalignment of the evaluated centroid or cluster member sequences against all centroidsequences using DIAMOND in --swipe mode which guarantees perfect pairwise alignmentsensitivity.
Do you think this produces a gold standard ground truth, or will there still be error here? I think highlight possible sources of error here could be helpful for the reader to understand evaluation limitations
-
Supplementary fig. 2
Can you make the fonts larger in these figures? This figure is sort of confusing with its separation into panels. I think it would be better to have a single figure with the matching color scheme used for earlier figures in the manuscript (one color per tool), and then draw a darker line for 0.05 error rate, or put an asterisk on the y axis for tools that maintain below that. Same feedback for Fig S3-S7
-
We hard-masked thisdatabase using tantan21 with default settings and removed all sequences that were masked over>10% of their range, resulting in a reduced database of 445,610,930 sequences.
What types of sequences did this remove? what biological biases are introduced here over using full nr?
-
Supplementary fig. 1
Can you make the fonts larger in these panels?
-
In the first round, we subsample the seed space using minimizers with awindow size of 1219which we empirically found to provide a good balance between speed andsensitivity, and attempt to achieve linear computational scaling of comparisons by consideringonly seed hits against the longest sequence for identical seeds rather than trialing all possiblecombinations15
Did you consider using UKHS or something like it here? https://kingsfordlab.cbd.cmu.edu/publication/orenstein-2016-compactkmers/
-
Fig. 2
Can you increase the font size for panels C and D and the y axis on panels A and B?
-
optimality of cluster assignment
How is this calculated?
-
clusterable homologs found
Where are the gold standard set of homologs defined?
-
Fig. 1
is it possible to increase the font text size in this figure? it is very difficult to read
-
AlthoughMMSeqs2/Linclust15 presented a considerable advancement over Cd-hit and UClust, it stillsuffers from comparatively low performance when clustering at high alignment sensitivity,thereby introducing an analytics bottleneck when attempting to scale to >27 billion estimated
but why? what is your methodological advancement that overcomes this?
-
Experimental Study
but this doesn't exist yet right? I think it would be good to clarify that here
-
can be compressed into 335 million centroids for downstream analyses (Supplementaryfig. 10)
It might be good to highlight that this is on the same order as the current size of NCBI nr (or at least I think it is), meaning our algorithms can already handle searches at this scale
-
18.1 million CPU hours compared to 194 million CPU hours withMMSeqs2 which makes this computation feasible today on existing HPC systems (Methods)
This doesn't feel like that big of a difference...yes, mmseqs2 would take 10x as long, but still feels like it could be accomplished on current compute infrastructure. If that's not true, I think it would be beneficial to highlight that.
-
these ~1.16 billion unique sequences comprise only ~6% of the full set of 19billion sequences
Do they come from weird taxonomies too? or metagenomes or something?
-
30% sequence identity
Why was this sequence identity selected?
-
Finally, we designed a re-clustering procedure that allows users to add newsequences to a large collection of existing clusters so that the sequencing and assemblycommunity can swiftly add incoming sequences to our biosphere cluster database without theneed to re-cluster the entire dataset (Methods)
How does this impact cluster membership? this feels akin to the 16s debate of OTUs vs. ASVs
-
18 days on 27 high
what was the RAM usage?
-
19 billion sequences
What are the database sources for these sequences? are they only eukaryotic, or do they include bacteria and archaea too?
-
Current protein clustering approachesimplemented in the standard tools CD-hit13, UClust14, and Linclust15 are limited when aiming tocluster billions of proteins with such broad sequence diversity in reasonable time and withsufficient clustering sensitivity at lower identity-boundarie
What are the limitations? why won't something like linclust work here?
-
species
is each of the 1.8 million genomes to come from a separate species, or will some be different strains of the same species?
-
we identified the ability tocluster this vast protein sequence diversity space as a key factor currently limiting theassociation of sequences across large sets of divergent species
Can you add a few more details on how you identified this? why are tools in mmseqs2 not sufficient here? what innovation is needed to overcome whatever barriers exist?
-
We presentDIAMOND DeepClust, an ultra-fast cascaded clustering method optimized to cluster the19 billion protein sequences currently defining the protein biosphere
It would be super helpful here to point out where these protein sequences come from -- NCBI nr, mmseqs sets, etc.
-
-