Mitigation and detection of putative microbial contaminant reads from long-read metagenomic datasets
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Metagenomic sequencing of clinical samples has significantly enhanced our understanding of microbial communities. However, microbial contamination and host-derived DNA remain a major obstacle to accurate data interpretation. Here, we present a methodology called ‘Stop-Check-Go’ for detecting and mitigating contaminants in metagenomic datasets obtained from neonatal patient samples (nasal and rectal swabs). This method incorporates laboratory and bioinformatics work combining a prevalence method, coverage estimation, and microbiological reports. We compared the ‘Stop-Check-Go’ decontamination system with other published decontamination tools, and commonly found poor performance in decontaminating microbiologically negative patients (false positives). We emphasize that host DNA decreased by an average of 76% per sample using a lysis method and was further reduced during post-sequencing analysis. Microbial species were classified as putative contaminants and assigned to ‘Stop’ in nearly 60% of the dataset. The ‘Stop-Check-Go’ system was developed to address the specific need of decontaminating low-biomass samples, where existing tools primarily designed for short-read metagenomic data showed limited performance.
Impact Statement
Metagenomics has gained popularity due to its diverse applications in the multi-omics research field and the improvements in sequencing performance of technologies such as Nanopore. However, challenges in biological interpretation remain because of the complexity of the data structure and the potential of contamination occurring at multiple steps during sample processing, which can lead to incorrect conclusions. We aim to raise awareness of contamination, which can be host-associated, cross-contamination, or library-derived, any of which may be introduced at any stage from sample collection.
Existing decontamination tools are largely designed for short-read sequencing and thus present limitations when applied to long-read datasets. We propose a direct comparison of species in samples with species in weekly negative controls that progressively accumulate both external and kit-reagent contaminants. Additionally, we recommend incorporating read-depth coverage and read-prevalence metrics, particularly in studies involving low-biomass or non-culturable microorganisms. Whenever possible, validation with microbiological reports is strongly advised. Our code is available on GitHub and can be executed locally in RStudio. It outputs species classifications labeled ‘Stop’, ‘Check’, or ‘Go’, as well as BIOM format files clean of identified contaminants, ready for downstream analysis with R packages such as phyloseq, vegan, or metagenomeSeq.
Data summary
The complete source code and documentation are available from GitHub ( https://github.com/SAM81221/Stop-Check-Go_TAPIR ). Metagenomic sequences including controls have been deposited in the ENA in project PRJEB82667; and isolate sequences of control samples in PRJEB95992. Information on samples and sequences can be found in Supplementary Table S1.