Telomere-to-Telomere Assembly Improves Host Reads Removal in Metagenomic High-Throughput Sequencing of Human Samples

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Metagenomic high-throughput sequencing brings revolution to the study of human microbiome, clinical pathogen detection, discovery and infection diagnosis, but clinical samples often contain abundant human nucleic acids, leading to a high proportion of host reads. A high-quality human reference genome is essential for removing host reads to make downstream analyses faster and more accurate. The recently published complete human genome, Telomere-to-Telomere CHM13 assembly (T2T), though achieved great success immediately, has yet to be tested for metagenomic sequencing. In this study, we demonstrated the innovation that T2T brings to the field, using a diverse set of samples containing 4.97 billion reads sequenced from 165 libraries, on short- and long-read platforms. To exclude the effect of algorithms in comparison of the genomes, we benchmarked the per-read performance of state-of-the-art algorithms. For short reads, bwa mem was the best-performing algorithm, with positive median of differences (MD) and adjusted p-values <0.001 for all comparisons, while no consistent difference in overall performance was found for long reads algorithms. T2T, when compared to current reference genomes hg38 and YH, significantly improved the per-read sensitivity (MD: 0.1443 to 0.7238 percentage point, all adjusted p-values < 0.001) in removing host reads for all sequencers, and the per-read Mathew’s correlation coefficient (MCC) with T2T was also higher (MD: 1.063 to 16.41 percentage point, all adjusted p-values <0.001). Genomic location of reads exclusively mappable to T2T concentrated mainly in newly added regions. Misclassified reads generally resulted from low complexity sequences, contaminations in reference genomes and sequencing abnormalities. In downstream microbe detection procedures, T2T did not affect true positive calls but greatly reduced false positive calls. The improvement in the ability to correctly remove host reads foretells the success of T2T to serve as the next prevailing reference genome in metagenomic sequencing of samples containing human nucleic acids.

Article activity feed

  1. Materials and Methods

    I've always used bbduk.sh to remove host reads. I've used the database linked below, and introduced in the seqanswers post below. One of the benefits to this method is that it separates FASTQ reads into different sets without requiring a bam, so it saves on hard disk space. The method below also uses a masked reference so as not remove true microbial reads that have homology to the human genome. My two comments are:

    1. could you benchmark with bbduk as well? I think this could be a real contribution in this space.
    2. could you mask the T2T assembly like the bbduk author did? I've really appreciated not accidentally removing plant/fungal etc sequences from my metagenomes.

    Seqanswers introduction to method: http://seqanswers.com/forums/archive/index.php/t-42552.html Database link: https://drive.google.com/file/d/0B3llHR93L14wd0pSSnFULUlhcUk/edit?usp=sharing

  2. Establishing the per-read gold standardBy each method, a read was given a label as whether it was of host origin (for kraken2, a TaxID assignment to any Chordata species was considered as a host label), but the true label of the read is unknown. Therefore, we compared all the labels given to each read to establish a de facto “ground truth”, or gold standard, by imposing the following criteria. A read was assigned a consensus label if concordant results were found for all methods. When the results were discordant, the reads were subjected to further examination. We used BLAST search results as the discriminating standard to resolve the discrepancy, because BLAST is an expensive yet very sensitive and more robust algorithm.26 Since the number of reads with discordant labels were too large to be all aligned by BLAST, we narrowed down the discrepancy by allowing a more tolerant standard for assigning consensus labels. If at least an alignment-based method and at least a k-mer-based method labelled a read as derived from host, we gave the host label to this read as the gold standard. The remaining reads with discordant labels were queried against the NCBI nr/nt database with BLAST (blast+ v2.12.0). If a hit to Chordata sequences were found with sufficiently high alignment quality (identity 90, coverage 90 for short reads and identity 70 for long reads), the read was considered truly of host origin, and otherwise non-host.

    I think it might be worthwhile to create a simulated set of reads from the human genome and from some microbes, such as E. coli or others with high quality genomes where it can be confirmed that there is no human contamination therein. Especially for ribosomes and other sequences with homology, it could be very difficult to establish a ground truth. The inquiries you did with real data are super important as well, and well designed given that real data is messy, but I think simulated data would be a strong contribution here.