A rapid phylogeny-based method for accurate community profiling of large-scale metabarcoding datasets

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    This potentially important work presents a tool for performing phylogenetic taxonomic classification of DNA sequences. In terms of methodology, the work is compelling. The authors perform a benchmark experiment against current state-of-the-art tools using real and simulated datasets to demonstrate where the novel tool stands in the context of existing methods. However, the experimentation is still incomplete. It would benefit from a more thorough exploration of existing methods as well as data sets that better represent real-world use cases.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Environmental DNA (eDNA) is becoming an increasingly important tool in diverse scientific fields from ecological biomonitoring to wastewater surveillance of viruses. The fundamental challenge in eDNA analyses has been the bioinformatical assignment of reads to taxonomic groups. It has long been known that full probabilistic methods for phylogenetic assignment are preferable, but unfortunately, such methods are computationally intensive and are typically inapplicable to modern next-generation sequencing data. We present a fast approximate likelihood method for phylogenetic assignment of DNA sequences. Applying the new method to several mock communities and simulated datasets, we show that it identifies more reads at both high and low taxonomic levels more accurately than other leading methods. The advantage of the method is particularly apparent in the presence of polymorphisms and/or sequencing errors and when the true species is not represented in the reference database.

Article activity feed

  1. Author response:

    Reviewer #2 (Public Review):

    This is, to my knowledge, the most scalable method for phylogenetic placement that uses likelihoods. The tool has an inter- esting and innovative means of using gaps, which I haven’t seen before. In the validation the authors demonstrate superior performance to existing tools for taxonomic annotation (though there are questions about the setup of the validation as described below).

    The program is written in C with no library dependencies. This is great. However, I wasn’t able to try out the software because the linking failed on Debian 11, and the binary artifact made by the GitHub Actions pipeline was too recent for my GLIBC/kernel. It’d be nice to provide a binary for people stuck on older kernels (our cluster is still on Ubuntu 18.04). Also, would it be hard to publish your .zipped binaries as packages?

    We have provided a binary (and zipped package) that supports Ubuntu 18.04 in GitHub Actions ( https://github.com/lpipes/tronko/actions/runs/9947708087). This should facilitate the use of our software on older sys- tems like yours. We were not able to test the binary however, since GitHub did not seem to find any nodes with Ubuntu 18.04. It is important to note that Ubuntu 18.04 is deprecated. The latest version of Ubuntu is 24.04, and we recommend users to upgrade to newer, supported versions of their operating systems to benefit from the latest security updates and features.

    Thank you for publishing your source files for the validation on zenodo. Please provide a script that would enable the user to rerun the analysis using those files, either on zenodo or on GitHub somewhere.

    We have posted all datasets as well as scripts to Zenodo.

    The validations need further attention as follows.

    First, the authors have not chosen data sets that are not well-aligned with real-world use cases for this software, and as a re- sult, its applicability is difficult to determine. First, the leave-one-species-out experiment made use of COI gene sequences representing 253 species from the order Charadriiformes, which includes bird species such as gulls and terns. What is the reasoning for selecting this data set given the objective of demonstrating the utility of Tronko for large scale community profiling experiments which by their nature tend to include microorganisms as subjects? If the authors are interested in evaluating COI (or another gene target) as a marker for characterizing the composition of eukaryotic populations, is the heterogeneity and species distribution of bird species within order Charadriiformes comparable to what one would expect in populations of organisms that might actually be the target of a metagenomic analysis?

    Our reasoning for selecting Charadriiformes is that these species are often misidentified for each other and there is a heavy reliance on COI for their species identification. This choice allows us to demonstrate Tronko’s ability to handle difficult and realistic identification challenges. Additionally, we aimed to simulate a challenging dataset to effectively differentiate between the methods used, showcasing Tronko’s robustness. Including more distantly related bird species would have simplified the identification process, which would not serve our objective of demonstrating the utility of Tronko for dis- tinguishing closely related species. It is also important to note that all methods used the exact same reference database which is not always the case in other species assignment comparative studies.

    Furthermore, while our study uses bird species, the principles and techniques applied are broadly applicable to other taxa, including microorganisms. By selecting a datase tknown for its identification difficulties, we underscore Tronko’spotential utility in a wide range of taxonomic profiling scenarios, including those involving high heterogeneity and closely related species, such as in microbial communities.

    Second, It appears that experiments evaluating performance for 16S were limited to reclassification of sequencing data from mock communities described in two publications, Schirmer (2015, 49 bacteria and 10 archaea, all environmental), and Gohl (2016; 20 bacteria - this is the widely used commercial mock community from BEI, all well-known human pathogens or commensals). The authors performed a comparison with kraken2, metaphlan2, and MEGAN using both the default database for each as well as the same database used for Tronko (kudos for including the latter). This pair of experiments provide a reasonable high-level indication of Tronko’s performance relative to other tools, but the total number of organ- isms is very limited, and particularly limited with respect to the human microbiome. It is also important to point out that these mock communities are composed primarily of type strains and provide limited species-level heterogeneity. The per- formance of these classification tools on type strains may not be representative of what one would find in natural samples. Thus, the leave-one-individual-out and leave-one-species-out experiments would have been more useful and informative had they been applied to extended 16S data sets representing more ecologically realistic populations.

    We thank the reviewer for this comment and we have included both an additional bacterial mock community dataset from Lluch et al. (2015) and an additional leave-one-species-out experiment. We describe how this leave-one-species-out dataset was constructed in our previous response to ’Essential Revisions’ #1. We also added Figure 5, S5, and S6.

    Finally, the authors should describe the composition of the databases used for classification as well as the strategy (and toolchain) used to select reference sequences. What databases were the reference sequences drawn from and by what criteria? Were the reference databases designed to reflect the composition of the mock communities (and if so, are they limited to species in those communities, or are additional related species included), or have the authors constructed general pur- pose reference databases? How many representatives of each species were included (on average), and were there efforts to represent a diversity of strains for each species? The methods should include a section detailing the construction of the data sets: as illustrated in this very study, the choice of reference database influences the quality of classification results, and the authors should explain the process and design considerations for database construction.

    To construct our databases, we used CRUX (Curd et al., 2018). This is described in the Methods section under ’Custom 16S and COI Tronko-build reference database construction’. All missing outs tests were downsamples of these two databases. It is beyond the scope of the manuscript to discuss how CRUX works. Additionally, we added the following text:

    To compare the new method (Tronko) to previous methods, we constructed reference databases for COI and 16S for com- mon amplicon primer sets using CRUX (See Methods for exact primers used).

  2. eLife assessment

    This potentially important work presents a tool for performing phylogenetic taxonomic classification of DNA sequences. In terms of methodology, the work is compelling. The authors perform a benchmark experiment against current state-of-the-art tools using real and simulated datasets to demonstrate where the novel tool stands in the context of existing methods. However, the experimentation is still incomplete. It would benefit from a more thorough exploration of existing methods as well as data sets that better represent real-world use cases.

  3. Reviewer #1 (Public Review):

    In this manuscript, the authors present Tronko, a novel tool for performing phylogenetic assignment of DNA sequences using an approximate likelihood approach. Through a benchmark experiment utilizing several real datasets from mock communities with pre-known composition as well as simulated datasets, the authors show that Tronko is able to achieve higher accuracy than several existing best-practice methods with runtime comparable to the fastest existing method, albeit with significantly higher peak memory usage than existing methods. The benchmark experiment was thorough, and the results clearly support the authors' conclusions. However, the paper could be improved by exploring how certain design choices (e.g. tool selection and parameter choices) may impact Tronko's performance/accuracy, and some relevant existing phylogenetic placement tools are missing and should be included.

  4. Reviewer #2 (Public Review):

    This is, to my knowledge, the most scalable method for phylogenetic placement that uses likelihoods. The tool has an interesting and innovative means of using gaps, which I haven't seen before. In the validation the authors demonstrate superior performance to existing tools for taxonomic annotation (though there are questions about the setup of the validation as described below).

    The program is written in C with no library dependencies. This is great. However, I wasn't able to try out the software because the linking failed on Debian 11, and the binary artifact made by the GitHub Actions pipeline was too recent for my GLIBC/kernel. It'd be nice to provide a binary for people stuck on older kernels (our cluster is still on Ubuntu 18.04). Also, would it be hard to publish your .zipped binaries as packages?

    Thank you for publishing your source files for the validation on zenodo. Please provide a script that would enable the user to rerun the analysis using those files, either on zenodo or on GitHub somewhere.

    The validations need further attention as follows.

    First, the authors have not chosen data sets that are not well-aligned with real-world use cases for this software, and as a result, its applicability is difficult to determine. First, the leave-one-species-out experiment made use of COI gene sequences representing 253 species from the order Charadriiformes, which includes bird species such as gulls and terns. What is the reasoning for selecting this data set given the objective of demonstrating the utility of Tronko for large scale community profiling experiments which by their nature tend to include microorganisms as subjects? If the authors are interested in evaluating COI (or another gene target) as a marker for characterizing the composition of eukaryotic populations, is the heterogeneity and species distribution of bird species within order Charadriiformes comparable to what one would expect in populations of organisms that might actually be the target of a metagenomic analysis?

    Second, It appears that experiments evaluating performance for 16S were limited to reclassification of sequencing data from mock communities described in two publications, Schirmer (2015, 49 bacteria and 10 archaea, all environmental), and Gohl (2016; 20 bacteria - this is the widely used commercial mock community from BEI, all well-known human pathogens or commensals). The authors performed a comparison with kraken2, metaphlan2, and MEGAN using both the default database for each as well as the same database used for Tronko (kudos for including the latter). This pair of experiments provide a reasonable high-level indication of Tronko's performance relative to other tools, but the total number of organisms is very limited, and particularly limited with respect to the human microbiome. It is also important to point out that these mock communities are composed primarily of type strains and provide limited species-level heterogeneity. The performance of these classification tools on type strains may not be representative of what one would find in natural samples. Thus, the leave-one-individual-out and leave-one-species-out experiments would have been more useful and informative had they been applied to extended 16S data sets representing more ecologically realistic populations.

    Finally, the authors should describe the composition of the databases used for classification as well as the strategy (and toolchain) used to select reference sequences. What databases were the reference sequences drawn from and by what criteria? Were the reference databases designed to reflect the composition of the mock communities (and if so, are they limited to species in those communities, or are additional related species included), or have the authors constructed general purpose reference databases? How many representatives of each species were included (on average), and were there efforts to represent a diversity of strains for each species? The methods should include a section detailing the construction of the data sets: as illustrated in this very study, the choice of reference database influences the quality of classification results, and the authors should explain the process and design considerations for database construction.

  5. Reviewer #3 (Public Review):

    Pipes and Nielsen propose a valuable new computational method for assigning individual Next Generation Sequencing (NGS) reads to their taxonomic group of origin, based on comparison with a dataset of reference metabarcode sequences (i.e. using an existing known marker sequence such as COI or 16S). The underlying problem is an important one, with broad applications such as identifying species of origin of smuggled goods, identifying the composition of metagenomics/ microbiomics samples, or detecting the presence of pathogen variants of concern from wastewater surveillance samples. Pipes and Nielsen propose (and make available with open source software) new computational methods, apply those methods to a series of exemplar data analyses mirroring plausible real-life scenarios, and compare the new method's performance to that of various field-leading alternative methods.

    In terms of methodology, the manuscript presents a novel computational analyses inspired by standard existing probabilistic phylogenetic models for the evolution of genome sequences. These form the basis for comparisons of each NGS read with a reference database of known examples spanning the taxonomic range of interest. The evolutionary aspects of the models are used (a) to statistically represent knowledge about the reference organisms (and uncertainty about their common ancestors) and their evolutionary relationships; and (b) to derive inferences about the relationship of the sample NGS reads that may be derived from reference organisms or from related organisms not represented in the reference dataset. This general approach has been considered previously and, while expected to be powerful in principle, the reliance of those methods on likelihood computations over a phylogenetic tree structure means they are slow to the point of useless on modern-sized problems that may have many thousands of reference sequences and many millions of NGS reads. Alternative methods that have been devised to be computationally feasible have had to sacrifice the phylogenetic approach, with a consequent loss of statistical power.

    Pipes and Nielsen's methodology contribution in this manuscript is to make a series of approximations to the 'ideal' phylogenetic likelihood analysis, aimed at saving computational time and keeping computer memory requirements acceptable whilst retaining as much as possible of the expected power of phylogenetic methods. Their description of their novel methods is solid; as they are largely approximations to other existing methods, their value ultimately will rest with the success of the method in application.

    Regarding the application of the new methods, to compare the accuracy of their method with a selection of existing methods the authors use 1) simulated datasets and 2) previously published mock community datasets to query sequencing reads against appropriate reference trees. The authors show that Tronko has a higher success at assigning query reads (at the species/genus/family level) than the existing tools with both datasets. In terms of computational performance, the authors show Tronko outperforms another phylogenetic tool, and is still within reasonable limits when compared with other 'lightweight' tools.

    As a demonstration of the power of phylogeny-based methods for taxonomic assignment, this ms. could gain added importance by refocusing the community towards explicitly phylogenetic methods. We agree with the authors that this would be likely to give rise to the most powerful possible methods.

    Strengths of this ms. are 1) the focus on phylogenetic approaches and 2) the reduction of a consequently difficult computational problem to a practical method (with freely available software); 3) the reminder that these approaches work well and are worthy of continued interest and development; and ultimately most-importantly 4) the creation of a powerful tool for taxonomic assignment that seems to be at least as good as any other and generally better.

    Weaknesses of the manuscript at present are 1) lack of consideration of some other existing methods and approaches, as it would be interesting to know if other ideas had been tried and rejected, or were not compatible with the methods created; 2) some over-simplifications in the description of new methods, with some aspects difficult or impossible to reproduce and some claims unsubstantiated. Further, 3) we are not convinced enough weight has been given to the complexity of 'pre-processing' the reference dataset for each metabarcode (e.g. gene) of interest, which may give the impression that the method is easier to apply to new reference datasets than we think would be the case. Lastly, 4) we encountered some difficulties getting the software installed and running on our computers. It was not possible to resolve every issue in the time available to us to perform our review, and some processing options remain untested.

    Overall, the methods that Pipes and Nielsen propose represent an important contribution that both creates a computational resource that is immediately valuable to the community, and emphasises the benefits of phylogenetic methods and provides encouragement for others to continue to work in this area to create still-better methods.