Tourmaline: a containerized workflow for rapid and iterable amplicon sequence analysis using QIIME 2 and Snakemake

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

Amplicon sequencing (metabarcoding) is a common method to survey diversity of environmental communities whereby a single genetic locus is amplified and sequenced from the DNA of whole or partial organisms, organismal traces (e.g., skin, mucus, feces), or microbes in an environmental sample. Several software packages exist for analyzing amplicon data, among which QIIME 2 has emerged as a popular option because of its broad functionality, plugin architecture, provenance tracking, and interactive visualizations. However, each new analysis requires the user to keep track of input and output file names, parameters, and commands; this lack of automation and standardization is inefficient and creates barriers to meta-analysis and sharing of results.

Findings

We developed Tourmaline, a Python-based workflow that implements QIIME 2 and is built using the Snakemake workflow management system. Starting from a configuration file that defines parameters and input files—a reference database, a sample metadata file, and a manifest or archive of FASTQ sequences—it uses QIIME 2 to run either the DADA2 or Deblur denoising algorithm, assigns taxonomy to the resulting representative sequences, performs analyses of taxonomic, alpha, and beta diversity, and generates an HTML report summarizing and linking to the output files. Features include support for multiple cores, automatic determination of trimming parameters using quality scores, representative sequence filtering (taxonomy, length, abundance, prevalence, or ID), support for multiple taxonomic classification and sequence alignment methods, outlier detection, and automated initialization of a new analysis using previous settings. The workflow runs natively on Linux and macOS or via a Docker container. We ran Tourmaline on a 16S rRNA amplicon dataset from Lake Erie surface water, showing its utility for parameter optimization and the ability to easily view interactive visualizations through the HTML report, QIIME 2 viewer, and R- and Python-based Jupyter notebooks.

Conclusions

Automated workflows like Tourmaline enable rapid analysis of environmental and biomedical amplicon data, decreasing the time from data generation to actionable results. Tourmaline is available for download at github.com/aomlomics/tourmaline.

Article activity feed

  1. Background

    **Reviewer 2. Haris Zafeiropoulos **

    I appreciated the opportunity to review your manuscript. Tourmaline aims at facilitating an easy-to-follow architecture for tracking input and output file names, parameters, and commands of QIIME2 runs to enhance meta-analyses. If I am not mistaken, this is the corner-stone of this study so my review is based on that. Running Tourmaline is straightforward and its documentation is exceptional. The video tutorial and the GitHub wiki (https://github.com/aomlomics/tourmaline/wiki) allows non-experienced users to start working their analysis and the containerized version of the tool allows an easy-to-go installation in multiple operating systems without extra effort. The extra visual components provide insight in a nice way and the report returned can provide added value on the runs. However, even if I do share the authors' interest on usability and interoperability and tools could have a great impact in the community indeed, Tourmaline currently lacks any substantial features to be considered as a stand-alone software tool. In addition, there are several issues that it is my belief that need to be addressed (see the following list). Major issues Major Issue #1: The authors claim that "this lack of automation and standardization [in tracking input and output file names, parameters, and commands on QIIME2] is inefficient and creates barriers to meta-analysis and sharing of results. Therefore, what Tourmaline and thus, the manuscript needs to demonstrate, is that meta-analyses are now feasible to a greater extent, thanks to the Tourmaline wrapper. Major Issue #2: Assuming that enhancing meta-analyses is the main contribution of Tourmaline, it is fundamental to consider the minimum information about a marker gene sequence (MIMARKS) standard of the Genomic Standards Consortium (GSC). Rather than just mentioning MIMARKS, Tourmaline needs to explore ways to exploit such standards, i.e. perhaps by adding MIMARKS columns in the config file. Major Issue #3: As QIIME2 has been developed on the basis of a plugin archiiated the opportunity to review your manuscript. Tourmaline aims at facilitating an easy-to-follow architecture for tracking input and output file names, parameters, and commands of QIIME2 runs to enhance meta-analyses. If I am not mistaken, this is the corner-stone of this study so my review is based on that. tecture, it would be highly recommended that such an application could be provided as a plugin too, joining the corresponding QIIME2 library (https://library.qiime2.org/plugins/). Major Issue #4: With respect to the structure of the manuscript, it is my belief that there are sections that should be omitted. Tutorials and "how to" are of extremely valuable but it would be better to be provided either as supplementary material or through repositories, e.g. GitHub wiki, GitHub pages etc, rather than in the main manuscript. The wiki page on Tourmaline's GitHub repository is rather informative. An alternative might be merging the "Overview" along with the "Snakefile", "Config file", "Input files" and "Run the workflow" sections, to describe "The Tourmaline workflow" architecture in a less verbose way, highlighting the role of the "Snakefile" and the "config.yml" files and the architecture that binds them together. "Documentation", "Installation", "Cloning" subsections could/should be omitted too. Major Issue #5: The test dataset does not allow the validation of Tourmaline in meta-analyses. It is rather important to have a testbed dataset to demonstrate "how to run" but a use case of an actual meta-analysis is required to demonstrate how different analyses can be combined in the framework of Tourmaline and provide further insight than those of the initial ones. Major Issue #7: No license has been included in the "Availability of supporting source code" section. On Tourmaline GitHub repo a license (https://github.com/aomlomics/tourmaline#license) is mentioned, yet GigaScience asks for an appropriate Open Source Initiative compliant license (https://opensource.org/licenses/category). In addition, I tried to find if the QIIME2 license is mentioned in a Tourmaline Docker container and I could not; if I am not mistaken that is required based on the QIIME2 license (https://github.com/qiime2/qiime2/blob/master/LICENSE). My apologies again in case of any misapprehension. Major Issue #8: Parameter optimization is indeed one of the greatest challenges in metabarcoding bioinformatics analyses. However, it is not clear to me how by keeping the exact same names in your output files, will you be able to compare the results of the different runs. Major Issue #9: I realise that the authors provide Figures 2 and 3 in a complementary way, presenting the visual component returned after each step. However, having a figure with 16 screenshots makes it hard for the reader to realize what is coming from QIIME2 and what from Tourmaline but most importantly does not highlight the added value that Tourmaline provides to such an analysis. It is my belief that Figure 2 could remain as it, while FIgure 3 should focus on the output components that are not provided by QIIME2 routines, but from Tourmaline functionalities. In case of a meta-analysis, this figure should highlight all the added value that using QIIME2 through the Tourmaline wrapper would provide. Minor issues Minor Issue #1: Please rephrase the Findings section in the abstract, so that it is clear that Tourmaline invokes QIIME2 routines to implement taxonomy assignment, perform analysis etc. It is required to state clearly what QIIME2 does and what are the extra features of Tourmaline throughout the manuscript. Minor Issue #2: The conclusion you mention in the abstract is not in line with the scope of Tourmaline that was described earlier. Tourmaline does not accelerate the performance of QIIME2 routines. Its aim, as mentioned earlier, is to enhance meta-analysis and sharing of results. Minor Issue #3: terms such as "meta-analysis", "reproducibility", "metadata" could be added Minor Issue #4: In line Information gained... resource management it would be nice to add references for the value of the method in each of the various fields mentioned. Minor Issue #5: Usually, (shotgun) metagenome analyses are used to measure diversity in microbiomes, meaning the functional, genomic diversity; the term microbiome has been widely used as the collection of genomes from all microbial taxa present in a sample. It would be better to rephrase this like "popular method of measuring taxonomic/microbial diversity of host microbiome or in environmental samples" Minor Issue #6: As an overall comment, long sentences make the manuscript hard to read. In this case: "PCR primers have been used to generate amplicons of the bacterial 16S rRNA gene in studies of human and animal microbiota [..] among others." should be splitted. Minor Issue #7: "other environmental surveys" please explain. Minor Issue #8: It is not clear to me how the study of Prodan et al. (2020) is related to the standardization of amplicon data analysis. Minor Issue #9: The authors highlight that the standard directory structure enhances data exploration and parameter optimization. A use case to demonstrate this main feature of Tourmaline would be of high value. Minor Issue #10: The purpose of performing amplicon sequencing or metabarcoding is to reveal patterns of diversity in biological systems. That is not the only case, please rephrase. Sincerely, Haris Zafeiropoulous

    Re-review:

    Most of my initial comments have been addressed to some extent. However, it is my belief that there is a major contradiction with this manuscript. As described in the introduction, Tourmaline is supposed to address challenges that make meta-analysis hard for metabarcoding studies. "; this lack of automation and standardization is inefficient and creates barriers to meta-analysis and sharing of results". However, as the authors highlight in their response, there are 5 points that make Tourmaline to set apart from other amplicon workflows. However, only one of them ("3- Snakemake features") is related (to some extent) to the challenge described. The rest are exceptional ways to make things easier for the users to run an analysis but they do not have a direct link with how to enable/support meta-analysis. Therefore, it is my belief that the Introduction section should be revised to better present the actual highlights of Tourmaline or further features (some of them described in my initial review) need to be added to support meta-analysis.

    Other Issues Even though the authors recognize the impact of metadata standards: they do not mention anything on their manuscript about them and their potential I was not able to figure out how "have made the metadata that comes with Tourmaline fully MIMARKS-compliant." If this software is focused on meta-analysis, I would strongly suggest investing more effort on describing how these could benefit the community and the Tourmaline users. In the parallelization section that was added, it is fundamental to mention that this is possible thanks to Qiime2 implementation. Snakemake is working as an interface allowing Tourmaline to support the options of Qiime2. If Qiime2 had no option for running on multiple threads, then Tourmaline would not inherit such a feature. The same applies for the merging step in the meta-analysis case. All Qiime2 commands used as such need to be clear that are Qiime2 commands that are performed in the Tourmaline workflow; otherwise it can be thought that the feature was developed from the Tourmaline team

  2. Abstract

    This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giac066), and has published the reviews under the same license. These are as follows.

    Reviewer 1. Anna Heintz-Buschart

    Reviewer Comments to Author: Thompson et al. present a workflow for amplicon sequencing analysis, wrapping commands from the commonly used QIIME2 package in the commonly used workflow manager Snakemake. The manuscript is clearly structured and contains figures of appropriate quality. Building and testing user-friendly workflows that facilitate the use of existing software is an important task for the research community. The chosen existing softwares (namely Snakemake as workflow manager and QIIME2's calls to DADA2 and deblur, as well as QIIME2's visualisations) are trusted and often used by the community. Next to manuscript, I have inspected the GitHub/wiki page of the proposed workflow and tested it on the provided test data as well as an independent data set. I have run into some issues, which I will put forward below, together with some comments on the content of and omissions from the manuscript.

    Manuscript:

    1. Overall, the reads more like a manual or tutorial than a methods description. The point-by-point description of the outputs may be a bit lengthy.
    2. The manuscript is missing information on runtimes and hardware requirements. This is in particular a pity because the workflow does not make use of parallelisation of the called tools. It might be pretty slow on large datasets?
    3. There is also no justification for the choices of the analyses that are done and the defaults that were chosen.
    4. In the introduction, other published amplicon sequencing workflows are cited and dismissed as not all well documented. Other than maybe not being so well documented, there are differences in scope between these workflows and the one described here. It would be very helpful for readers to be informed on how the described workflow is set apart from those workflows. And also from QIIME2 (all of the images in figure 3, for example, are QIIME2's visualisation work and not part of the workflow's report). Finally, tagseq, which also wraps QIIME2 commands in snakemake, is not mentioned. From my point of view, the workflow still requires the user to be able to do quite a lot of data setup in the command line environment and requires knowledge of QIIME2, while resolving relatively little by wrapping the commands in Snakemake (also see my next point). Clearly, it would be helpful to discuss what existing problem the workflow overcomes that the others (and QIIME2) don't.
    5. In the same paragraph of the introduction, it is mentioned that the workflow might evolve with QIIME2. However, it makes use of only a small part of QIIME's options/commands - is there a plan to widen the scope? How will continuous support be done? Is there a plan to integrate the workflow into the QIIME software ecosystem? Similarly, the workflow is not using Snakemake to its full potential, e.g. it does require several manual installation steps instead of making use of Snakemake's conda integration; it doesn't make use of Snakemake's reporting ability, which might be interesting together with QIIME2's data provenance. So, is there a plan to improve this? Also for developing it towards better usability on (cloud) cluster structures?

    Testruns: What worked: Overall, I could install the software on a linux machine by following the description on the GitHub webpage. The test run worked as expected. The data was accessible and I could visualise it using the QIIME2 online viewer. I could run the workflow on an unrelated dataset.

    What could be improved:

    a) The setup of the input was a bit annoying, because the names and paths to the inputs and outputs need to be set at various places. The fact that existing and non-existing inputs have to be defined in the config confused me at first. The error messages that ensued from not doing this right were uninformative (these cases could be caught by the Snakefile with or without the help of a scheme). b) while the workflow is very well documented, the settings for the individual demonising / taxonomy steps are not. The links to the QIIME2 documentation don't point to the current version.

    What didn't work: Running a small dataset - the workflow expects to be able to do statistical test with groups and replicates. However, only a late step checks if the data set is suitable, so there-s a failure after considerable running time, which is annoying. While this kind of analysis may be the most common application, it's not the only one. It would be good if those parts of the workflow that require certain dataset structures could be switched off.

    minor: i) As a very irregular user of QIIME2, I find the QIIME2-jargon difficult to understand (e.g. artefacts and artefact equivalents, manifesto, and the QIIME2 names of the DADA2 and deblur steps, emperor plot...). It would be better, if these were defined (and maybe not all discussed in detail). ii) Personally, I would like to have a primer removal step in the workflow. But that's a design decision that can be discussed. iii) "the fungal internal transcribed spacer (ITS) of the rRNA gene (Abarenkov et al. 2010)" - the internal transcribed spacer is not within an rRNA gene, but the different ITS regions are found between rRNA genes.