Importance of database curation in taxonomic assignation of 16S data.

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Microbial identification is the key component to microbial community analysis. Since mid-2000s, with the advent of Next-generation sequencing techniques, it has been necessary to use increasingly refined and complete databases to uniquely assign the taxonomy of each sequence or taxonomic unit. In this study we evaluate the relevance of the database curation in this assignation process.

Article activity feed

  1. The reviewers have highlighted major concerns with the work presented. Please ensure that you address their comments. Please deposit the data underlying the work in the Society’s data repository Figshare account here: https://microbiology.figshare.com/submit. Please also cite this data in the Data Summary of the main manuscript and list it as a unique reference in the References section. When you resubmit your article, the Editorial staff will post this data publicly on Figshare and add the DOI to the Data Summary section where you have cited it. This data will be viewable on the Figshare website with a link to the preprint and vice versa, allowing for greater discovery of your work, and the unique DOI of the data means it can be cited independently. Please provide more detail in the Methods section and ensure that software is consistently cited and its version and parameters included.

  2. Comments to Author

    I appreciate the author's efforts toward the accurate taxonomic assignment of 16S rRNA data. Current manuscript version is poorly written, and there is no research found for the topic. A similar study (PMID: 30602085) reports that EzBioCloud performs well compared with other existing databases. However, the study did not use EzBioCloud data to create curated database named WellMicro. Why only V3-V4 regions? As several metagenome data is publicly available, why did the study use whole genome data to create a mock dataset? The manuscript needs more elaboration with current measures in metagenome data analytics and the accuracy of WellMicro at the genus and species levels.

    Please rate the manuscript for methodological rigour

    Poor

    Please rate the quality of the presentation and structure of the manuscript

    Poor

    To what extent are the conclusions supported by the data?

    Not at all

    Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?

    No

    Is there a potential financial or other conflict of interest between yourself and the author(s)?

    No

    If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?

    No: Ethical clearance not applicable to the study

  3. Comments to Author

    The authors are addressing a key issue in the microbiome field, as many of the 16S databases are limited in their curation. However, there are a few issues with the work which are essential before publication can be endorsed. Major issues; 1. The authors do not make their database available. Looking at the linked GitHub, the script for reproducing the results is good, but requires that you have all the databases, which are not present in the GitHub and no additional links are provided. Without the WMdb being made available, this paper provides no benefit to the community. If this database is closed to the community then publication can not be endorsed as results can not be validated. 2. The creation of the WMdb is unclear in parts, for example, the different databases use different lineage systems. How were these combined? This is a major issue in the field, and will be further added to with the creation of the SeqCode, how do you determine which taxonomy is correct? 3. The mock data provided by the authors in the GitHub are full length sequences. I assume these are those that were 'extracted from complete bacterial genomes randomly downloaded from NCBI'. But why are the V3-V4 regions not also provided? Where are the random subsets of the fragments which were included with artificial generated DNA sequences? I must say I am not impressed by the lack of data provided by the authors in this regard. Using the data provided by the authors, it would be impossible to replicate this study. However, I do think the idea of the mock communities is good. One issue though, is the potential bias of the WMdb to have been optimised to work on these mock communities by the authors having put additional work into ensuring these taxa are covered. As such, real life use-cases are needed to validate the results shown in FIgure 1. I would suggest analysis of the HMP 16S datasets, along with a terrestrial dataset, such as the TARA data. This would allow for the applicability of the WMdb to be accessed in a real world setting. Minor issue; 1. Greengenes has recently been updated and I would suggest including both the old, and new versions in the analysis. 2. Replace 'L1' etc. with the taxonomic level e.g. phyla. in the supplementary figures.

    Please rate the manuscript for methodological rigour

    Satisfactory

    Please rate the quality of the presentation and structure of the manuscript

    Good

    To what extent are the conclusions supported by the data?

    Partially support

    Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?

    No

    Is there a potential financial or other conflict of interest between yourself and the author(s)?

    No

    If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?

    Yes