BugSplit enables genome-resolved metagenomics through highly accurate taxonomic binning of metagenomic assemblies

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

A large gap remains between sequencing a microbial community and characterizing all of the organisms inside of it. Here we develop a novel method to taxonomically bin metagenomic assemblies through alignment of contigs against a reference database. We show that this workflow, BugSplit, bins metagenome-assembled contigs to species with a 33% absolute improvement in F1-score when compared to alternative tools. We perform nanopore mNGS on patients with COVID-19, and using a reference database predating COVID-19, demonstrate that BugSplit’s taxonomic binning enables sensitive and specific detection of a novel coronavirus not possible with other approaches. When applied to nanopore mNGS data from cases of Klebsiella pneumoniae and Neisseria gonorrhoeae infection, BugSplit’s taxonomic binning accurately separates pathogen sequences from those of the host and microbiota, and unlocks the possibility of sequence typing, in silico serotyping, and antimicrobial resistance prediction of each organism within a sample. BugSplit is available at https://bugseq.com/academic .

Article activity feed

  1. SciScore for 10.1101/2021.10.16.464647: (What is this?)

    Please note, not all rigor criteria are appropriate for all manuscripts.

    Table 1: Rigor

    Ethicsnot detected.
    Sex as a biological variablenot detected.
    RandomizationAs minimap2 randomly picks a primary alignment if there are multiple alignments with equal top score, we collapse equally good top hits to their lowest common ancestor.
    Blindingnot detected.
    Power Analysisnot detected.

    Table 2: Resources

    Software and Algorithms
    SentencesResources
    A mash database44, published by the mash authors and comprising all genomes and plasmid sequences in Refseq (https://gembox.cbcb.umd.edu/mash/refseq.genomes%2Bplasmid.k21s1000.msh) is used for homology search with Homopolish.
    Refseq
    suggested: (RefSeq, RRID:SCR_003496)
    In brief, plasmid sequences are identified with PlasmidFinder58, and their taxonomic identities are overridden to that of “plasmid sequences” (NCBI taxon 36549).
    PlasmidFinder58
    suggested: None
    MMseqs2 and DIAMOND were run with the NCBI non-redundant amino acid database as suggested by their authors.
    DIAMOND
    suggested: (DIAMOND, RRID:SCR_009457)
    These files were generated by converting the NCBI taxonomy files (names.dmp and nodes.dmp) provided with the CAMI datasets into Newick format with the Python taxonomy package59.
    Python
    suggested: (IPython, RRID:SCR_001658)
    Ground truths were generated by comparing each contig in our metagenomic assembly to the reference genome of each organism contained within the mock microbial community using MegaBLASTN.
    MegaBLASTN
    suggested: None
    The taxonomic identification of the top BLAST hit for each contig was determined to be its gold standard assignment.
    BLAST
    suggested: (BLASTX, RRID:SCR_001653)
    Binning completion and contamination were assessed with CheckM using the default CheckM database.
    CheckM
    suggested: (CheckM, RRID:SCR_016646)
    The NCBI nucleotide database from 2019 was downloaded from the second CAMI challenge (https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_2_DATABASES/ncbi_blast/nt.gz) and used in place of BugSplit’s default database for the emerging coronavirus application.
    BugSplit’s
    suggested: None

    Results from OddPub: Thank you for sharing your data.


    Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:
    By incorporating graph topology and linkage of contigs, we will be able to mitigate this limitation and place the contig in multiple strain-level taxonomic bins. Further exploration of the parameter space of BugSplit may also result in improved binning. For example, minimap2 could be tuned for greater alignment recall while preserving precision than its default “map-ont” setting, and voting coverage thresholds may be able to be tuned for improved classification of contigs across the taxonomic hierarchy. Ultimately, we expect to adopt a strategy that will allow optimal values for key parameters to be determined by the taxonomic lineage of alignments. BugSplit is a highly accurate tool for taxonomic binning and profiling of third-generation metagenomic data with computing speeds faster than comparable workflows. We show that using BugSplit to bin metagenomic assemblies has several substantial downstream effects, including enabling highly similar species discrimination and identification, novel species identification and universal, pathogen-agnostic taxonomic profiling. When combined with automated assembly, polishing and post-processing of bins, we demonstrate that detecting pathogens, strain-typing them and accurately predicting their antimicrobial resistance directly from complex samples with mNGS becomes feasible.

    Results from TrialIdentifier: No clinical trial numbers were referenced.


    Results from Barzooka: We did not find any issues relating to the usage of bar graphs.


    Results from JetFighter: We did not find any issues relating to colormaps.


    Results from rtransparent:
    • Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
    • No funding statement was detected.
    • No protocol registration statement was detected.

    Results from scite Reference Check: We found no unreliable references.


    About SciScore

    SciScore is an automated tool that is designed to assist expert reviewers by finding and presenting formulaic information scattered throughout a paper in a standard, easy to digest format. SciScore checks for the presence and correctness of RRIDs (research resource identifiers), and for rigor criteria such as sex and investigator blinding. For details on the theoretical underpinning of rigor criteria and the tools shown here, including references cited, please follow this link.