Contamination detection and microbiome exploration with GRIMER

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

Contamination detection is a important step that should be carefully considered in early stages when designing and performing microbiome studies to avoid biased outcomes. Detecting and removing true contaminants is challenging, especially in low-biomass samples or in studies lacking proper controls. Interactive visualizations and analysis platforms are crucial to better guide this step, to help to identify and detect noisy patterns that could potentially be contamination. Additionally, external evidence, like aggregation of several contamination detection methods and the use of common contaminants reported in the literature could help to discover and mitigate contamination.

Results

We propose GRIMER, a tool that performs automated analyses and generates a portable and interactive dashboard integrating annotation, taxonomy and metadata. It unifies several sources of evidence to help detect contamination. GRIMER is independent of quantification methods and directly analyses contingency tables to create an interactive and offline report. Reports can be created in seconds and are accessible for non-specialists, providing an intuitive set of charts to explore data distribution among observations and samples and its connections with external sources. Further, we compiled and used an extensive list of possible external contaminant taxa and common contaminants with 210 genera and 627 species reported in 22 published articles.

Conclusion

GRIMER enables visual data exploration and analysis, supporting contamination detection in microbiome studies. The tool and data presented are open-source and available at: https://gitlab.com/dacs-hpi/grimer .

Article activity feed

  1. Background

    Reviewer2-Raphael Eisenhofer

    Piro and Renard introduce GRIMER, a tool that automates microbiome-related analyses and creates rich, offline-supported report that can be shared with collaborators or hosted online. I think that they gave a great summary of the problem of contamination in the microbiome field, and clearly explain the gap that their software fills. They exhibit GRIMER on previously published datasets, which are available to view online. Overall, I'm very impressed with the dashboard—it looks great, is easy to explore datasets, and highly portable. I can certainly see myself using GRIMER on some of my future datasets, and I have no doubt that it can be a valuable tool for others in the field. I do however think that the documentation and usability of the tool can be improved, and I give some suggestions below. Addressing these issues will, in my opinion, lead to a wider adoption of the tool by researchers in the field.Usability:I managed to test GRIMER on a 16S amplicon dataset, but given the sparsity of the documentation, this took me a little longer than expected (in addition to quite a few steps), and I think that there are improvements that could be made to make it easier for people to use GRIMER from formats that people commonly generate.For example, QIIME2 is perhaps the most used 16S amplicon analysis pipeline, so the ability to import directly from .qza files (e.g. table.qza, taxonomy.qza) would give GRIMER much greater reach. If this is beyond the scope to incorporate within the GRIMER codebase, at least provide the exact code needed in the documentation for people to export their .qza files to files compatible with GRIMER.Likewise from phyloseq, a commonly used R package for microbiome analyses. Could some documentation/code be added about how best to export phyloseq objects to a format that GRIMER can handle?I mostly analyse shotgun metagenomic datasets (genome-resolved), and I foresee more users using these types of data in the future. Therefore, the ability to parse gtdb-tk outputs directly would be very helpful. Perhaps have a flag --gtdb that parses the 'gtdbtk.bac120.summary.tsv' and 'gtdbtk.ar53.summary.tsv' files.Following on from this, CoverM (https://github.com/wwood/CoverM) is quite commonly used for generating final MAG count tables (.tsv), so the ability to import them directly would be a really nice quality-of-life addition, and something that would not require much coding to accomplish.I believe that these adjustments will make the tool far more accessible for everyday users and increase the adoption of GRIMER by the wider community.For the actual report, if possible, I would like the ability to export ASVs/features/MAGs from the report that the user thinks are contaminants. This could be a list that the user could copy/paste, or the direct export of a .txt/.tsv. Perhaps the user could tick a box next to the ASVs/features/MAGs to save them to a list/viewer? The reason for this is that the logical next step I see after using GRIMER is to go back to your dataset and filter out the putative contaminant ASVs/features/MAGs. Being able to produce such a list will make subsequent filtering by the user easier.I couldn't get decontam to work with my dataset, here was the error:raise KeyError(f"None of [{key}] are in the [{axis_name}]")KeyError: "None of [Float64Index([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n nan, nan, nan, nan],\n dtype='float64')] are in the [index]"I can post this as an issue on the repo if you'd like.Regarding the specification of negative and positive controls in the config.yaml, would it be possible for this to be implemented from the executable? For example, there could be a flag '--control-column' that specifies the column in the user's metadata file. '--control-column control' would parse the 'control' metadata column, and for cases where are values 'negative', 'positive' assign them automatically. This is just a suggestion that could make it a bit easier for users to set control samples, rather than having to create a new .txt file and change the config.yml.Dependencies:When installing via conda, I ran into the following error:ImportError: cannot import name 'PearsonRConstantInputWarning' from 'scipy.stats'It seems that this can't be imported from later versions of scipy, but I managed to fix it by forcing scipy=1.8.1. You should be able to force this version in the conda recipe.Minor grammar:Line 16: replace 'perform' with 'performs'Line 50: 'found in the [9]'Line 56: replace 'as technicians body' with 'microbes from laboratory technicians'Line 60: I would remove the 'environmental' adjective here, as contamination affects all low-biomass samples.Line 63: I would use 'samples' in place of 'environments' here. You may also consider suggesting that some samples may even contain no microbial DNA. E.g. replace 'low amounts of' with 'little to no'.Line 64: Replace 'ideal scenario for an exogenous contaminants' with 'an ideal scenario for exogenous contaminants'.Line 72: perhaps consider referencing decontam here.Line 79: replace 'due to increase in costs' with 'due to the increase in cost associated with their inclusion'.Line 81: Consider referencing first author's last name, e.g. 'Moreover, XXX et al. [45] reported…'Line 88: remove 'outcomes'

  2. Abstract

    Reviewer1-Gavin M Douglas

    Piro and Renard present GRIMER, which is a bioinformatics tool for summarizing microbiome taxonomic data in various ways, with the main purpose of identifying putatively contaminant taxa. The authors convincingly argue that there is great value in looking at several different aspects of a dataset when determining which taxa are potential contaminants. I think this tool could potentially be very useful for the field, but I think at the moment there are several places where users might be confused and perhaps be overwhelmed without more documentation.The main point of confusion I'm concerned about is regarding the "common contaminants". It's not convincing that you can just classify a taxon as a contaminant regardless of what environment is being profiled. Also, under this approach, if a taxon is identified once as a contaminant in an earlier study, would it then be classified as a contaminant in all datasets processed by GRIMER? This would mean that a lot of high-abundance taxa in certain environments would be wrongly thrown out. For instance, you can imagine high-abundance taxa on the human skin might be more likely to be contaminants during sequencing preparation, but of course many researchers are very interested in profiling the skin microbiome. I think the authors realize this, but I'm concerned that typical users may not appreciate this point. I think explicit discussion of this point in the discussion is needed and also an example of how this might look in practice (e.g., if skin microbiome samples were input to GRIMER, as part of a larger tutorial that could be online [see next point], would help avoid this mistake).The authors do a great job of walking through some results in the text, but more documentation is needed for the reports. The authors should include a basic tutorial that provides example input files and then walks through each individual tab. This could done all through text with screenshots of the GRIMER, or perhaps with a video tutorial. In addition, for someone just opening the example reports, I'm sure they will be wondering what data was produced by GRIMER (e.g., they might wrongly think GRIMER did the taxonomic classiciation) and what data was needed as input.The authors should expand on how the correlation step is used to identify contaminants. There is great interest in identifying clusters of co-occurring taxa, so identifying a cluster of 9 genera in Figure 5 doesn't seem like evidence of contamination to me. Perhaps it is when considered with other lines of evidence though, but this should be made clearer. Currently this legend implies that it alone points to reagent-derived contaminationThe figure text needs to be increased in size. Using more panels split across additional rows and removing unnecessary info (e.g., not all control categories need to be shown in Figure 1) would make these figures easier to interpret. I realize that you were hoping to use the raw GRIMER figures, but based on the current display items it does not seem like they are publication ready.The acronym WGS generally refers to "whole genome sequencing" (i.e., for single isolate organisms) not "whole metagenome sequencing". The standard acronym for the latter case would be "MGS", for "metagenomics". Also, the term "shotgun metagenomics sequencing" is mostly commonly used in this context, I've never come across "whole metagenome sequencing" before. Either way, "WGS" will mislead casual readers with the current usage, so this should be changed on your website and in the manuscript.The taxa parsing capabilities sound like they will save a lot of tedious, manual data mapping! Just checking - how does it perform with new taxa names / typos?Text editsL11 - "are challenging task" should be "is challenging"L12 - can remove "by design"L12 - "helping to" should be "to help"L13 - "can potentially be a source" I think should be "that could reflect"L14 - "evidences" should be "evidence"L13

    • L14 - Unclear what is meant by "external evidences, aggregation of methods and data and common contaminant" - should be clarifiedL15 - "that perform" should be "that performs"L17 - "towards contamination detection" should be something like "to help detect contamination"L41 - "hypothesis" should be "hypotheses"L42/43 - "analysis can hardly be fully" should be something like "the required analysis is difficult to fully…"L56 - "technicians body" should be "a technician's body"L60 - "strongly affects environmental" should be "especially environmental," (note comma)L64 - "ideal scenario for an" should be "an ideal scenario for"L67 - "not to bias measurements and not to" should be reworded, possibly as: "to not bias measurements and to ensure that bias is not propagated into databases"L75 - "were proposed. They are " should be "have been proposed. These are"L77 - "among others" should be ", and others" (note comma)L79 - "increase in costs" should be "the required increase in costs"L88 - add "a" before focusL90, L196, L265, and elsewhere - "evidences" should be "evidence"L99, L104, L117, and possibly elsewhere - "analysis" should be "analyses" (when plural)L106 - "each samples/compositions" should be "each sample/composition"L110 - add "a" before taxonomy database and "the" before "DNA concentration"L132 - "specially" should be "especially"L134 - remove "a" before "the"L151 - add "of" after "thousands"L182 - "is" should be "are"L196 - "evidences" should be "evidence". And rather than "Evidences towards" it would be correct to say "Evidence for" or "Evidence supporting"L208 - add "the" before "overall"L246/247 - "generated several studies and investigations" should be something like "motivated several investigations"L248 - should be something like "from the maternal and fetal sides"L279 - remove "a"L280 - Add "the" before "Jet"L284 - capitalize "Qiita" and re-word "Pick closedreference OTUs with 97% annotated with greengenes taxonomy"L293 - Should be "Furthermore" rather than "Further"L295 - I think it should be "with low and high human exposure, respectively"? Or do you mean they both have highly variable exposure?L297 - "could be a also an" should be "could be driven by an"L300 - "against" should be "and"L304 - "correlated genus" should be "correlated genera" (and in other cases, such as in the Fig 5 and 6 legends, where "genus" should be plural version, i.e., "genera")L305 - "Such pattern" should be "Such a pattern"L307 - Should be "groups" rather than "organisms groups", or just "genera" as I believe each is a genusL313 - Remove "a"Fig 5 legend: "point" should be "points"Fig 6 legend: "taxa is abundant" should be "This taxon is abundant" and "inversely correlate" should be "inversely correlated". "a contamination evidence" should be "potential contamination"