Metagenomics-Toolkit: The Flexible and Efficient Cloud-Based Metagenomics Workflow featuring Machine Learning-Enabled Resource Allocation

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The metagenome analysis of complex environments with thousands of datasets, such as those available in the Sequence Read Archive, requires immense computational resources to complete the computational work within an acceptable time frame. Such large-scale analyses require that the underlying infrastructure is used efficiently. In addition, any analysis should be fully reproducible and the workflow must be publicly available to allow other researchers to understand the reasoning behind computed results. Here, we introduce the Metagenomics-Toolkit, a scalable, data agnostic workflow that automates the analysis of short and long metagenomic reads obtained from Illumina or Oxford Nanopore Technology devices, respectively. The Metagenomics-Toolkit offers not only standard features expected in a metagenome workflow, such as quality control, assembly, binning, and annotation, but also distinctive features, such as plasmid identification based on various tools, the recovery of unassembled microbial community members and the discovery of microbial interdependencies through a combination of dereplication, co-occurrence, and genome-scale metabolic modeling. Furthermore, the Metagenomics-Toolkit includes a machine learning-optimized assembly step that tailors the peak RAM value requested by a metagenome assembler to match actual requirements, thereby minimizing the dependency on dedicated high-memory hardware. While the Metagenomics Toolkit can be executed on user workstations, it also offers several optimizations for an efficient cloud-based cluster execution. We compare the Metagenomics-Toolkit to five commonly used metagenomics workflows and demonstrate the capabilities of the Metagenomics-Toolkit by executing it on 757 metagenome datasets from sewage samples for an investigation of a possible sewage core microbiome. The Metagenomics-Toolkit is open source and available at https://github.com/metagenomics/metagenomics-tk.

Article activity feed