Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

The biosphere genomics era is transforming life science research, but existing methods struggle to efficiently reduce the vast dimensionality of the protein universe. We present DIAMOND DeepClust, an ultra-fast cascaded clustering method optimized to cluster the 19 billion protein sequences currently defining the protein biosphere. As a result, we detect 1.7 billion clusters of which 32% hold more than one sequence. This means that 544 million clusters represent 94% of all known proteins, illustrating that clustering across the tree of life can significantly accelerate comparative studies in the Earth BioGenome era.

Article activity feed

  1. We presentDIAMOND DeepClust, an ultra-fast cascaded clustering method optimized to cluster the19 billion protein sequences currently defining the protein biosphere

    It would be super helpful here to point out where these protein sequences come from -- NCBI nr, mmseqs sets, etc.

  2. we identified the ability tocluster this vast protein sequence diversity space as a key factor currently limiting theassociation of sequences across large sets of divergent species

    Can you add a few more details on how you identified this? why are tools in mmseqs2 not sufficient here? what innovation is needed to overcome whatever barriers exist?

  3. Current protein clustering approachesimplemented in the standard tools CD-hit13, UClust14, and Linclust15 are limited when aiming tocluster billions of proteins with such broad sequence diversity in reasonable time and withsufficient clustering sensitivity at lower identity-boundarie

    What are the limitations? why won't something like linclust work here?

  4. 18.1 million CPU hours compared to 194 million CPU hours withMMSeqs2 which makes this computation feasible today on existing HPC systems (Methods)

    This doesn't feel like that big of a difference...yes, mmseqs2 would take 10x as long, but still feels like it could be accomplished on current compute infrastructure. If that's not true, I think it would be beneficial to highlight that.

  5. these ~1.16 billion unique sequences comprise only ~6% of the full set of 19billion sequences

    Do they come from weird taxonomies too? or metagenomes or something?

  6. Finally, we designed a re-clustering procedure that allows users to add newsequences to a large collection of existing clusters so that the sequencing and assemblycommunity can swiftly add incoming sequences to our biosphere cluster database without theneed to re-cluster the entire dataset (Methods)

    How does this impact cluster membership? this feels akin to the 16s debate of OTUs vs. ASVs

  7. AlthoughMMSeqs2/Linclust15 presented a considerable advancement over Cd-hit and UClust, it stillsuffers from comparatively low performance when clustering at high alignment sensitivity,thereby introducing an analytics bottleneck when attempting to scale to >27 billion estimated

    but why? what is your methodological advancement that overcomes this?

  8. can be compressed into 335 million centroids for downstream analyses (Supplementaryfig. 10)

    It might be good to highlight that this is on the same order as the current size of NCBI nr (or at least I think it is), meaning our algorithms can already handle searches at this scale

  9. In the first round, we subsample the seed space using minimizers with awindow size of 1219which we empirically found to provide a good balance between speed andsensitivity, and attempt to achieve linear computational scaling of comparisons by consideringonly seed hits against the longest sequence for identical seeds rather than trialing all possiblecombinations15

    Did you consider using UKHS or something like it here? https://kingsfordlab.cbd.cmu.edu/publication/orenstein-2016-compactkmers/

  10. We hard-masked thisdatabase using tantan21 with default settings and removed all sequences that were masked over>10% of their range, resulting in a reduced database of 445,610,930 sequences.

    What types of sequences did this remove? what biological biases are introduced here over using full nr?

  11. Supplementary fig. 2

    Can you make the fonts larger in these figures? This figure is sort of confusing with its separation into panels. I think it would be better to have a single figure with the matching color scheme used for earlier figures in the manuscript (one color per tool), and then draw a darker line for 0.05 error rate, or put an asterisk on the y axis for tools that maintain below that. Same feedback for Fig S3-S7

  12. We established the ground truth for these evaluations by computing a full Smith Watermanalignment of the evaluated centroid or cluster member sequences against all centroidsequences using DIAMOND in --swipe mode which guarantees perfect pairwise alignmentsensitivity.

    Do you think this produces a gold standard ground truth, or will there still be error here? I think highlight possible sources of error here could be helpful for the reader to understand evaluation limitations

  13. We established the ground truth for these evaluations by computing a full Smith Watermanalignment of the evaluated centroid or cluster member sequences against all centroidsequences using DIAMOND in --swipe mode which guarantees perfect pairwise alignmentsensitivity.

    Do you think this produces a gold standard ground truth, or will there still be error here? I think highlight possible sources of error here could be helpful for the reader to understand evaluation limitations

  14. Supplementary fig. 2

    Can you make the fonts larger in these figures? This figure is sort of confusing with its separation into panels. I think it would be better to have a single figure with the matching color scheme used for earlier figures in the manuscript (one color per tool), and then draw a darker line for 0.05 error rate, or put an asterisk on the y axis for tools that maintain below that. Same feedback for Fig S3-S7

  15. We hard-masked thisdatabase using tantan21 with default settings and removed all sequences that were masked over>10% of their range, resulting in a reduced database of 445,610,930 sequences.

    What types of sequences did this remove? what biological biases are introduced here over using full nr?

  16. In the first round, we subsample the seed space using minimizers with awindow size of 1219which we empirically found to provide a good balance between speed andsensitivity, and attempt to achieve linear computational scaling of comparisons by consideringonly seed hits against the longest sequence for identical seeds rather than trialing all possiblecombinations15

    Did you consider using UKHS or something like it here? https://kingsfordlab.cbd.cmu.edu/publication/orenstein-2016-compactkmers/

  17. AlthoughMMSeqs2/Linclust15 presented a considerable advancement over Cd-hit and UClust, it stillsuffers from comparatively low performance when clustering at high alignment sensitivity,thereby introducing an analytics bottleneck when attempting to scale to >27 billion estimated

    but why? what is your methodological advancement that overcomes this?

  18. can be compressed into 335 million centroids for downstream analyses (Supplementaryfig. 10)

    It might be good to highlight that this is on the same order as the current size of NCBI nr (or at least I think it is), meaning our algorithms can already handle searches at this scale

  19. 18.1 million CPU hours compared to 194 million CPU hours withMMSeqs2 which makes this computation feasible today on existing HPC systems (Methods)

    This doesn't feel like that big of a difference...yes, mmseqs2 would take 10x as long, but still feels like it could be accomplished on current compute infrastructure. If that's not true, I think it would be beneficial to highlight that.

  20. these ~1.16 billion unique sequences comprise only ~6% of the full set of 19billion sequences

    Do they come from weird taxonomies too? or metagenomes or something?

  21. Finally, we designed a re-clustering procedure that allows users to add newsequences to a large collection of existing clusters so that the sequencing and assemblycommunity can swiftly add incoming sequences to our biosphere cluster database without theneed to re-cluster the entire dataset (Methods)

    How does this impact cluster membership? this feels akin to the 16s debate of OTUs vs. ASVs

  22. Current protein clustering approachesimplemented in the standard tools CD-hit13, UClust14, and Linclust15 are limited when aiming tocluster billions of proteins with such broad sequence diversity in reasonable time and withsufficient clustering sensitivity at lower identity-boundarie

    What are the limitations? why won't something like linclust work here?

  23. we identified the ability tocluster this vast protein sequence diversity space as a key factor currently limiting theassociation of sequences across large sets of divergent species

    Can you add a few more details on how you identified this? why are tools in mmseqs2 not sufficient here? what innovation is needed to overcome whatever barriers exist?

  24. We presentDIAMOND DeepClust, an ultra-fast cascaded clustering method optimized to cluster the19 billion protein sequences currently defining the protein biosphere

    It would be super helpful here to point out where these protein sequences come from -- NCBI nr, mmseqs sets, etc.