Sputum Respiratory Pathogen Genomic Surveillance: A Practical Approach for Long-Read Metagenomic Sequencing

Abstract

Severe acute respiratory infections (SARI) remain a major global health concern, particularly in resource-limited settings where comprehensive pathogen detection is challenging. Conventional diagnostics, including culture and multiplex PCR, are restricted to predefined targets and may miss emerging or uncommon pathogens. This study aimed to develop and optimize a practical long-read metagenomic next-generation sequencing (mNGS) workflow using Oxford Nanopore Technology for sputum samples from adults hospitalized with SARI. We evaluated sputum liquefaction, host nucleic acid depletion strategies, and SMART-9N-based cDNA amplification to enhance microbial and viral nucleic acid recovery. We found that dithiothreitol (DTT) treatment significantly improved nucleic acid extraction. DNase I treatment effectively reduced host background while preserving both viral and bacterial sequences, outperforming filtration-based approaches that reduced viral recovery. The optimized workflow enabled unbiased detection and strain-level resolution of respiratory pathogens, including rhinovirus C1/C42, human coronavirus HKU1, Mycoplasma pneumoniae, Haemophilus parainfluenzae , and Pseudomonas aeruginosa , with high genome coverage. This approach demonstrated robust performance for respiratory pathogen identification directly from sputum samples. The proposed workflow is scalable and suitable for clinical diagnostics and public health surveillance. Further validation in larger cohorts is warranted to assess diagnostic sensitivity, accuracy, and feasibility for routine implementation.

This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/20417815.

Major Issues

The sample size is very small for the strength of the claims. The pretreatment comparison appears to rely on two sputum samples, and the optimized workflow is reported across six cases. This is appropriate for a pilot optimization study, but not enough to support claims of "robust performance," "diagnostic accuracy," or routine clinical/surveillance readiness. The authors should frame the study as a proof-of-concept and substantially soften claims about scalability and clinical implementation.
Diagnostic validation is insufficient. The manuscript reports pathogen detection by mNGS, but it is unclear how each detected organism was confirmed. Orthogonal validation by multiplex PCR, targeted PCR, culture, qPCR, or reference sequencing is needed, especially for strain-level calls and bacterial detections from sputum. Without this, it is difficult to distinguish true infection, colonization, contamination, or database misclassification.
The study lacks sensitivity, specificity, and limit-of-detection assessment. For a diagnostic or surveillance workflow, the authors should include analytical sensitivity using spiked controls or dilution series, specificity using negative sputum controls, reproducibility across replicates, and comparison to standard clinical testing. The current data show feasibility, but not diagnostic performance.
"Unbiased detection" should be qualified. The workflow includes DNase treatment, SMART-9N amplification, 30 PCR cycles, host filtering, assembly, and database-based classification. Each step introduces bias. SMART-9N amplification may distort relative abundance and genome coverage, while DNase and filtration can differentially affect bacteria, DNA viruses, RNA viruses, and free nucleic acid. The authors should describe the workflow as broad-range rather than fully unbiased.
Clinical interpretation of sputum organisms needs caution. Sputum contains oral flora and colonizing organisms. Detections such as Haemophilus parainfluenzae, Pseudomonas spp., and Metamycoplasma salivarium may not necessarily represent causative pathogens. The manuscript should include clinical metadata, comparator diagnostic results, bacterial load thresholds, or criteria for interpreting pathogen relevance.
Strain-level claims need stronger evidence. The manuscript states strain/genotype-level resolution for all six cases, including rhinovirus C1/C42 and human rhinovirus NAT001. The authors should specify the classification thresholds, genome breadth, sequence identity, coverage uniformity, reference database versions, and whether assemblies were phylogenetically validated. Kraken2 classification alone is usually not sufficient for confident strain-level reporting.
Data and code availability are missing or unclear. I did not see a clear data availability statement for raw reads, assemblies, reference databases, or scripts. For a methods paper, reproducibility is central. The authors should deposit sequencing data, provide accession numbers, share command-line parameters, and include processed tables for host read fraction, microbial read counts, coverage, and detected taxa.

Minor Issues

Please report the exact number of samples used in each experiment more clearly in the Results and Methods.
The Results use language such as "significantly improved," but no statistical testing is presented. Use descriptive language or provide statistical analysis.
Clarify whether replicates were technical aliquots, independent extractions, sequencing replicates, or separate patients.
Figure 2 labels are small and difficult to read. Enlarging labels and adding a table of read proportions would help.
Figure 4 uses different y-axis scales across pathogens. This is acceptable, but the caption should emphasize that visual comparison of depth between panels is not direct.
Add genome breadth/percent coverage alongside depth. High depth in some regions does not necessarily mean near-complete recovery.
Clarify whether DNase I was applied before or after extraction for each workflow, and how RNA viruses, DNA viruses, and bacterial cells are expected to be affected.
The methods say "two host depletion strategies" but list untreated control, DNase I, filtration plus DNase I, and adaptive sampling. Reword for clarity.
The 72-hour sequencing duration may limit rapid diagnostic use. Please report when actionable pathogen calls became available during sequencing.

Consider consistent taxonomy and clinical naming for Mycoplasma/Mycoplasmoides pneumoniae.

Competing interests

The author declares that they have no competing interests.

Use of Artificial Intelligence (AI)

The author declares that they did not use generative AI to come up with new ideas for their review.

Read the original source

Sputum Respiratory Pathogen Genomic Surveillance: A Practical Approach for Long-Read Metagenomic Sequencing

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Competing interests

Use of Artificial Intelligence (AI)