The GEN-ERA toolbox: unified and reproducible workflows for research in microbial genomics

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

Microbial culture collections play a key role in taxonomy by studying the diversity of their strains and providing well-characterized biological material to the scientific community for fundamental and applied research. These microbial resource centers thus need to implement new standards in species delineation, including whole-genome sequencing and phylogenomics. In this context, the genomic needs of the Belgian Coordinated Collections of Microorganisms were studied, resulting in the GEN-ERA toolbox. The latter is a unified cluster of bioinformatic workflows dedicated to both bacteria and small eukaryotes (e.g., yeasts).

Findings

This public toolbox allows researchers without a specific training in bioinformatics to perform robust phylogenomic analyses. Hence, it facilitates all steps from genome downloading and quality assessment, including genomic contamination estimation, to tree reconstruction. It also offers workflows for average nucleotide identity comparisons and metabolic modeling.

Technical details

Nextflow workflows are launched by a single command and are available on the GEN-ERA GitHub repository (https://github.com/Lcornet/GENERA). All the workflows are based on Singularity containers to increase reproducibility.

Testing

The toolbox was developed for a diversity of microorganisms, including bacteria and fungi. It was further tested on an empirical dataset of 18 (meta)genomes of early branching Cyanobacteria, providing the most up-to-date phylogenomic analysis of the Gloeobacterales order, the first group to diverge in the evolutionary tree of Cyanobacteria.

Conclusion

The GEN-ERA toolbox can be used to infer completely reproducible comparative genomic and metabolic analyses on prokaryotes and small eukaryotes. Although designed for routine bioinformatics of culture collections, it can also be used by all researchers interested in microbial taxonomy, as exemplified by our case study on Gloeobacterales.

Article activity feed

  1. Background

    Reviewer 2: Ben Woodcroft

    Cornet et al have generated a collection of NextFlow pipelines which provide a pipeline to analyse data associated with genome or raw sequencing data of microbial organisms and protists. The methodology appears sound and reproducible. My main concern with the manuscript is that it is not well described in the abstract, introduction or GitHub repository. It isn't clear whether the analyses are specific for genomics questions arising from culture collections, or if it is more broadly applicable. There is also no discussion about other pipelines which achieve similar things e.g. ATLAS https://metagenome-atlas.github.io/

    I also had a number of minor concerns, detailed below.

    A number of grammatical errors detected, these should be fixed. Parts of the manuscript are also slightly too informal e.g. "This confirms the interest of 221using ORPER to spot interesting SSU rRNA sequences" It would be helpful if the GitHub front page could provide a concise description of what the software aims to achieve, to make its use more understandable. 106: "as it happened" grammatical error "Assembly.nf" Commonly assembly is a separate process to binning, but here binning has been included. Perhaps a clearer name might be Genome-recovery.nf ? 124: "Researchers interested in a better understanding of these tools can read the recent review on the detection of genomic contamination made by Cornet et al. [15]." While not inappropriate, this is perhaps too much self-citation. Why is contamination assessed but not completeness? 129: "annotation of bacterial proteins is automatic" Automatic in what sense? Annotation also refers to describing the function of the protein usually, but here the meaning appears to be restricted to ORF calling. I found this somewhat confusing. Also "in the different GEN-ERA workflows" is unclear - does this mean that prodigal is run as part of the Assembly.nf workflow for instance? 143: "Orthology.nf automatically provides the core genes, shared by all the organisms in unicopy" what is meant by "all organisms" here? 145: "The OGs of proteins 145 can be further enriched" what does "enriched" mean? 163: GTDB.nf is described in the "Other workflows" section, when it is phylogeny-related. 172: "it was 173 technically not possible to include Mantis in a container" I am curious as to why this was the case? I do not have any specific insight or ability to judge the accuracy of this statement, just curious. Inclusion of a sentence describing the difficulties might help other workflow developers and/or the Mantis developers. 190: "Gloeobacterales are the most basal order of the 191 Cyanobacteria phylum" This statement is somewhat controversial, because the GTDB has defined the Melainobacteria as being a part of the Cyanobacteria phylum based on RED values. I would suggest removing "the most basal" or making it clear that cyanobacteria refers to photosynthetic cyanobacteria rather than the phylum. 189: The methods for this section are not described in the methods section. They are only briefly described in the Findings section. A clearer link to these methods should be made from the maintext and methods. 212: Showed -> show. 215: "estimate the sequencing level of the order" it isn't clear what meaning this has. 224: Our results demonstrate the absence of one metabolic 225pathway" There are many metabolic pathways, presumably it is missing more than one. 233: "examples of the practical usage of the GEN-ERA toolbox are available in Supplemental 234File 1." this does not make it clear that this refers to the methods for this specific example.

  2. Background Microbial culture collections play a key role in taxonomy by studying the diversity of their accessions and providing well characterized strains to the scientific community for fundamental and applied research.

    This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giad022), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1: Shakuntala Baichoo

    Paper Title: The GEN-ERA toolbox: unified and reproducible workflows for research in microbial genomics The GEN-ERA toolbox provides a number of containerized workflows to researchers (without any specific training in bioinformatics) to study the diversity of well-characterized strains for fundamental and applied research. More specifically It facilitates all steps from genome downloading and quality assessment, including genomic contamination estimation, to tree phylogenetic reconstruction. It additionally provides workflows for average nucleotide identity comparisons and metabolic modeling. The supplementary file provides details of how to run the whole workflow (through 10 steps), found in the GEN-ERA toolbox on basal, for an empirical dataset of early emerging cyanobacteria. It provides an up-to-date phylogenomic analysis of the Gloeobacteralesorder, the first group to diverge in the evolutionary tree of Cyanobacteria. The github repo located at https://github.com/Lcornet/GENERA also provides more details about the GEN-ERA toolssuite. Though in the manuscript it is mentioned that the call to Mantis could not be included in the Singularity call, on the github repo they have indicated that Mantis is now installed in a singularity container for the Metabolic workflow (install is no longer necessary). The tool has been tested on an empirical dataset of 18 (meta)genomes of early-branching Cyanobacteria and the time taken as well as the results of the run are documented in the supplementary file. The authors claim that the toolsuite can be used to study the diversity of microorganisms, including bacteria and fungi. From the github repo, it is clear that a number of publications in high-impact journal papers have already resulted from the development of the GEN-ERA.

    1. Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? This study aims at describing a toolbox, named GEN-ERA, and the methods section defines the various steps of the toolsuite. Looking at the supplementary file and the github, it is easy to follow the manuscript. The versions of the programs used in the case study are provided in the forms of nextflow scripts.

    2. Are the conclusions adequately supported by the data shown? The results of running the toolsuite on an empirical dataset of 18 (meta)genomes of early-branching Cyanobacteria, at each step, as well as the time taken to download the files and the running each step, are convincing that it works fine, at least for Cyanobateria. But this is found in the Supplementary Material. There should be section on Discussion and Conclusion in the main text.

    3. Please indicate the quality of language in the manuscript. Does it require a heavy editing for language and clarity? But t The use of English language is adequate and concise and can be understood clearly, by researchers interested in studying diversity of micro-organisms.

    4. Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? The statistics involved in the phylogenetic analyses are integrated in the existing programs. Hence I am not able to assess the statistics.

    5. Final Comments The proposed toolbox/toolsuite described in this manuscript is very relevant and worth a read for researchers interested in studying the diversity of microorganisms, including bacteria and fungi, especially as it helps to facilitate their life through the use of well-defined containerized NextFlow workflows.

    I strongly believe that there should be a section on the Discussion of the results of running the toolbox for the case study and a Conclusion in the main manuscript. This will help readers in understanding the importance of the toolbox better.