BugSplit enables genome-resolved metagenomics through highly accurate taxonomic binning of metagenomic assemblies

Abstract

A large gap remains between sequencing a microbial community and characterizing all of the organisms inside of it. Here we develop a novel method to taxonomically bin metagenomic assemblies through alignment of contigs against a reference database. We show that this workflow, BugSplit, bins metagenome-assembled contigs to species with a 33% absolute improvement in F1-score when compared to alternative tools. We perform nanopore mNGS on patients with COVID-19, and using a reference database predating COVID-19, demonstrate that BugSplit’s taxonomic binning enables sensitive and specific detection of a novel coronavirus not possible with other approaches. When applied to nanopore mNGS data from cases of Klebsiella pneumoniae and Neisseria gonorrhoeae infection, BugSplit’s taxonomic binning accurately separates pathogen sequences from those of the host and microbiota, and unlocks the possibility of sequence typing, in silico serotyping, and antimicrobial resistance prediction of each organism within a sample. BugSplit is available at https://bugseq.com/academic .

SciScore for 10.1101/2021.10.16.464647: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

Ethics	not detected.
Sex as a biological variable	not detected.
Randomization	As minimap2 randomly picks a primary alignment if there are multiple alignments with equal top score, we collapse equally good top hits to their lowest common ancestor.
Blinding	not detected.
Power Analysis	not detected.

Table 2: Resources

Software and Algorithms
Sentences	Resources
A mash database44, published by the mash authors and comprising all genomes and plasmid sequences in Refseq (https://gembox.cbcb.umd.edu/mash/refseq.genomes%2Bplasmid.k21s1000.msh) is used for homology search with Homopolish.	Refseq suggested: (RefSeq, RRID:SCR_003496)
In brief, plasmid sequences are identified with PlasmidFinder58, and their taxonomic …

SciScore for 10.1101/2021.10.16.464647: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

Ethics	not detected.
Sex as a biological variable	not detected.
Randomization	As minimap2 randomly picks a primary alignment if there are multiple alignments with equal top score, we collapse equally good top hits to their lowest common ancestor.
Blinding	not detected.
Power Analysis	not detected.

Table 2: Resources

Software and Algorithms
Sentences	Resources
A mash database44, published by the mash authors and comprising all genomes and plasmid sequences in Refseq (https://gembox.cbcb.umd.edu/mash/refseq.genomes%2Bplasmid.k21s1000.msh) is used for homology search with Homopolish.	Refseq suggested: (RefSeq, RRID:SCR_003496)
In brief, plasmid sequences are identified with PlasmidFinder58, and their taxonomic identities are overridden to that of “plasmid sequences” (NCBI taxon 36549).	PlasmidFinder58 suggested: None
MMseqs2 and DIAMOND were run with the NCBI non-redundant amino acid database as suggested by their authors.	DIAMOND suggested: (DIAMOND, RRID:SCR_009457)
These files were generated by converting the NCBI taxonomy files (names.dmp and nodes.dmp) provided with the CAMI datasets into Newick format with the Python taxonomy package59.	Python suggested: (IPython, RRID:SCR_001658)
Ground truths were generated by comparing each contig in our metagenomic assembly to the reference genome of each organism contained within the mock microbial community using MegaBLASTN.	MegaBLASTN suggested: None
The taxonomic identification of the top BLAST hit for each contig was determined to be its gold standard assignment.	BLAST suggested: (BLASTX, RRID:SCR_001653)
Binning completion and contamination were assessed with CheckM using the default CheckM database.	CheckM suggested: (CheckM, RRID:SCR_016646)
The NCBI nucleotide database from 2019 was downloaded from the second CAMI challenge (https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_2_DATABASES/ncbi_blast/nt.gz) and used in place of BugSplit’s default database for the emerging coronavirus application.	BugSplit’s suggested: None

Results from OddPub: Thank you for sharing your data.

Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:

By incorporating graph topology and linkage of contigs, we will be able to mitigate this limitation and place the contig in multiple strain-level taxonomic bins. Further exploration of the parameter space of BugSplit may also result in improved binning. For example, minimap2 could be tuned for greater alignment recall while preserving precision than its default “map-ont” setting, and voting coverage thresholds may be able to be tuned for improved classification of contigs across the taxonomic hierarchy. Ultimately, we expect to adopt a strategy that will allow optimal values for key parameters to be determined by the taxonomic lineage of alignments. BugSplit is a highly accurate tool for taxonomic binning and profiling of third-generation metagenomic data with computing speeds faster than comparable workflows. We show that using BugSplit to bin metagenomic assemblies has several substantial downstream effects, including enabling highly similar species discrimination and identification, novel species identification and universal, pathogen-agnostic taxonomic profiling. When combined with automated assembly, polishing and post-processing of bins, we demonstrate that detecting pathogens, strain-typing them and accurately predicting their antimicrobial resistance directly from complex samples with mNGS becomes feasible.

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
No funding statement was detected.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

BugSplit enables genome-resolved metagenomics through highly accurate taxonomic binning of metagenomic assemblies

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

Shotgun metagenomics: a deep insight into the composition and function of the complex microbial world

MiCoReCa (Microbiome Community Resource Catalogue) - Towards Centralized Curation And Integration Of Microbiome Bioinformatics Resources

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

Shotgun metagenomics: a deep insight into the composition and function of the complex microbial world

MiCoReCa (Microbiome Community Resource Catalogue) - Towards Centralized Curation And Integration Of Microbiome Bioinformatics Resources