MADRe: Strain-level metagenomic classification through assembly-driven database reduction
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (GigaScience)
Abstract
Strain-level metagenomic classification is essential for understanding microbial diversity and functional potential, yet remains challenging, particularly when sample composition is unknown and reference databases are large and redundant. Here, we present MADRe, a modular and scalable pipeline for long-read strain-level metagenomic classification based on Metagenome Assembly-Driven Database Reduction. Beyond system-level integration, MADRe introduces statistical strategies that leverage assembly-derived genomic context to guide database reduction and probabilistic read reassignment. Specifically, it combines long-read metagenome assembly, contig-to-reference reassignment using an expectation–maximization framework for reference reduction, and probabilistic read mapping reassignment on a reduced database to achieve sensitive and precise strain-level classification. We extensively evaluated MADRe on simulated datasets, mock communities, and a real anaerobic digester sludge metagenome. Across diverse similarity and coverage conditions, MADRe consistently improves precision by reducing false-positive strain detections. MADRe’s design allows users to apply either the database reduction or read classification step individually. Using only the read classification step shows results on par with other tested tools. MADRe is open source and publicly available at https://github.com/lbcb-sci/MADRe.
Article activity feed
-
AbstractStrain-level metagenomic classification is essential for understanding microbial diversity and functional potential, but remains challenging, par- ticularly in the absence of prior knowledge about the composition of the sample. In this paper we present MADRe, a modular and scalable pipeline for long-read strain-level metagenomic classification, enhanced with Metagenome Assembly-Driven Database Reduction. MADRe com- bines long-read metagenome assembly, contig-to-reference mapping reas- signment based on an expectation-maximization algorithm for database reduction, and probabilistic read mapping reassignment to achieve sensi- tive and precise classification. We extensively evaluated MADRe on sim- ulated datasets, mock communities, and a real anaerobic digester sludge metagenome, demonstrating that it consistently outperforms …
AbstractStrain-level metagenomic classification is essential for understanding microbial diversity and functional potential, but remains challenging, par- ticularly in the absence of prior knowledge about the composition of the sample. In this paper we present MADRe, a modular and scalable pipeline for long-read strain-level metagenomic classification, enhanced with Metagenome Assembly-Driven Database Reduction. MADRe com- bines long-read metagenome assembly, contig-to-reference mapping reas- signment based on an expectation-maximization algorithm for database reduction, and probabilistic read mapping reassignment to achieve sensi- tive and precise classification. We extensively evaluated MADRe on sim- ulated datasets, mock communities, and a real anaerobic digester sludge metagenome, demonstrating that it consistently outperforms existing tools by achieving higher precision with reduced false positives. MADRe’s de- sign allows users to apply either the database reduction or read classi- fication step individually. Using only the read classification step shows results on par with other tested tools. MADRe is open source and pub- licly available at https://github.com/lbcb-sci/MADRe.
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag030), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer 2:
This manuscript presents MADRe, a modular pipeline for strain-level metagenomic classification from long-read data, emphasizing an assembly-driven database reduction strategy coupled with probabilistic reassignment. The work is methodologically sound and well aligned with the scope of GigaScience. However, the study can be benefit from the following revisions:
1, the study's main contribution is engineering and integration, rather than a fundamentally new statistical model. The authors thus should explicitly mention this in the Abstract as well as the Discussion part.
2, although comparisons are reasonable, the manuscript could do more to clarify how MADRe compares against state-of-the-art strain-resolved tools under identical parameter tuning, and whether performance gains are consistent across different strain divergence levels.
3, when comparing with existing tools, improvements appear primarily in precision, while recall trade-offs are less emphasized. The authors should explicitly discuss precision-recall trade-offs and clarify in which biological scenarios MADRe is most advantageous.
4, While database reduction is presented as efficient, the computational cost of assembly plus EM iterations is not deeply analyzed. The authors should include a concise runtime/memory comparison or at least a qualitative discussion of computational trade-offs.
5, The approach implicitly assumes that metagenome assembly is sufficiently accurate and representative. However, in highly complex or low-coverage samples, assembly could be fragmented or biased. The authors should add a clearer discussion on the sensitivity to assembler choice and parameters.
-
AbstractStrain-level metagenomic classification is essential for understanding microbial diversity and functional potential, but remains challenging, par- ticularly in the absence of prior knowledge about the composition of the sample. In this paper we present MADRe, a modular and scalable pipeline for long-read strain-level metagenomic classification, enhanced with Metagenome Assembly-Driven Database Reduction. MADRe com- bines long-read metagenome assembly, contig-to-reference mapping reas- signment based on an expectation-maximization algorithm for database reduction, and probabilistic read mapping reassignment to achieve sensi- tive and precise classification. We extensively evaluated MADRe on sim- ulated datasets, mock communities, and a real anaerobic digester sludge metagenome, demonstrating that it consistently outperforms …
AbstractStrain-level metagenomic classification is essential for understanding microbial diversity and functional potential, but remains challenging, par- ticularly in the absence of prior knowledge about the composition of the sample. In this paper we present MADRe, a modular and scalable pipeline for long-read strain-level metagenomic classification, enhanced with Metagenome Assembly-Driven Database Reduction. MADRe com- bines long-read metagenome assembly, contig-to-reference mapping reas- signment based on an expectation-maximization algorithm for database reduction, and probabilistic read mapping reassignment to achieve sensi- tive and precise classification. We extensively evaluated MADRe on sim- ulated datasets, mock communities, and a real anaerobic digester sludge metagenome, demonstrating that it consistently outperforms existing tools by achieving higher precision with reduced false positives. MADRe’s de- sign allows users to apply either the database reduction or read classi- fication step individually. Using only the read classification step shows results on par with other tested tools. MADRe is open source and pub- licly available at https://github.com/lbcb-sci/MADRe.
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag030), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer 1:
I have no significant concerns with the MADRe methodology, and the current datasets provide sufficient evidence of its strain-level performance. However, several issues still need to be addressed.
The reponse states: "However, we observed a limitation when Centrifuger cannot confidently assign a read to a specific reference sequence (for example, when multiple chromosomes belong to the same strain). In such cases, it often classifies the read under the NCBI strain-level taxid, which in some instances is identical to the species-level taxid. This makes it impossible to directly and fairly compare those classifications with other tools that operate at the sequence level."
Although I agree this issue may not substantially affect the overall conclusions, the current handling of strain-level evaluation for Centrifuger is not sufficiently rigorous. The underlying problem is that Centrifuger (and Kraken2) rely on nodes.dmp and names.dmp, where the lowest taxonomic rank is often species or subspecies. As a result, these tools cannot report strain-level abundances directly in their standard output. A more appropriate solution would be to assign custom, unique strain-level taxIDs for all reference genomes, allowing proper classification at the strain level. This approach has been discussed in https://github.com/mourisl/centrifuger/issues/18 and https://github.com/jenniferlu717/Bracken/issues/113. Additionally, Centrifuger has an extra program, centrifuger-quant, that uses the EM algorithm to estimate abundance. The read assignment results produced by Centrifuger do not apply the EM algorithm.
In the similarity experiment, some strains exhibit extremely high similarity, which makes proportional read distribution practically impossible for MADRe. To better characterize the performance limits of MADRe for accurate strain classification and abundance estimation, I recommend including additional simple synthetic mixtures at different combinations of similarity and coverage depth. Because long reads vary widely in length, read counts alone can be misleading. I strongly encourage reporting strain abundances rather than raw read counts, as abundances are more relevant for downstream applications. Finally, the authors should clarify whether MADRe's limitations in detecting low-abundance strains (referring more to low coverage) is entirely determined by the performance of the assembly tool, or whether additional factors influence this limitation.
In Figure 4, please specify the sequencing technology used for sim_high. "calculated usin fastANI" →"calculated using fastANI".
-
-
