Nanopore long-read only genome assembly of clinical Enterobacterales isolates is complete and accurate
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Whole bacterial genome sequence reconstruction using Oxford Nanopore Technologies (“Nanopore”) long-read only sequencing may offer a lower-cost, higher-throughput alternative for pathogen surveillance to ‘hybrid’ assembly with recent improvements in Nanopore sequencing accuracy. We evaluated the accuracy, including plasmid reconstruction, of Nanopore long-read only genome assemblies of Enterobacterales.
We sequenced 92 genomes from clinical Enterobacterales isolates, collected in England under a national surveillance program, with long-read Nanopore (R10.4.1, Dorado v5.0.0 super-high-accuracy basecalled) and short-read Illumina (NovaSeq) sequencing approaches. Genomes were assembled using three long-read only (Flye; Hybracter long; Autocycler), and three hybrid assemblers (Hybracter hybrid; Unicycler normal; bold). Three polishing modalities (Medaka v2 with subsampled or un-subsampled long-reads; Polypolish + Pypolca with short-reads) were investigated.
Autocycler circularised the most chromosomes (87/92 [95%]). Plasmid sequence reconstruction was comparable between all assemblers except Flye, all recovering 90-96% of plasmids, although the ‘ground truth’ was uncertain. Flye performed worse than other assemblers on almost all metrics. Autocycler + Medaka (un-subsampled long-reads) was the most accurate long-read only assembler/polisher combination, comparable to hybrid assemblies (median 0 [IQR:0-0] SNPs and 0 [IQR:0-1] indels per genome; quality value/Q score, 100 [IQR: 64-100]), with only 4/92 genome sequences having >10 SNPs/indels. Medaka polishing with un-subsampled long-reads resulted in small improvements in indels but not SNPs for both Flye and Autocycler assemblies. Seven-locus MLST, antimicrobial resistance, virulence, and stress gene annotation was equivalent across assembler/polisher combinations.
Nanopore long-read only bacterial genome assembly with Autocycler combined with Medaka polishing (using un-subsampled reads) is similarly accurate and possibly more complete than hybrid assemblies, representing a viable alternative for incorporating high-quality genomic data, including plasmids, into Enterobacterales surveillance.
Data Summary
Nanopore long-reads and Illumina short-reads from the 92 Enterobacterales isolates from this study have been uploaded to ENA (BioProject accession: PRJEB93885). Code for the Nextflow assembly pipeline, downstream analysis scripts, and R statistical analysis scripts are available on GitHub ( https://github.com/oxfordmmm/NEKSUS_ont_hybrid_assembly_comparison ). The following supplementary data tables are available on FigShare ( https://figshare.com/account/home#/projects/253775 ):
-
ENA Sample accessions and sample metadata (accessions_and_metadata.csv)
-
Seqkit stats summaries of the Illumina and Nanopore reads (raw_qc_sup.cav)
-
Summary of assembly contig features (contigs_summary_sup_cleaned.csv)
-
Pairwise mash distances between contigs (mash_cleaned.csv)
-
Plasmids matching across different assemblers compared to the Hybracter (hybrid) and manually-curated reference sets (plasmids_match_hybracter_mash.csv; plasmids_match_manual_mash.csv, respectively)
-
Seven-locus multi-locus sequence type annotation (mlst_cleaned.csv)
-
CheckM2 summaries of assemblies (checkm2_cleaned.csv)
-
Nucleotide-level accuracy of assemblies (SNP, Indels, and Quality value compared to short-read mapping; assembly_nucleotide_accuracy_cleaned.csv)
-
Bakta annotation (bakta_by_contig_cleaned.csv)
-
AMRFinderPlus annotations of contigs (amrfinder_plus_cleaned.csv)
-
MOB-suite annotation summaries of contigs (mobsuite_cleaned.csv)
Impact Statement
Nanopore long-reads have historically been too error-prone to use alone for accurate bacterial genome assembly, necessitating additional Illumina short-reads to achieve structurally complete and accurate ‘hybrid’ genome assemblies for public health surveillance. This increases cost and complexity. Previous studies have shown that recent improvements in Nanopore chemistry (R10.4.1 flowcell) and basecalling (super-high accuracy) allow high-quality long-read only assemblies on a small number of laboratory reference strains. This is the first evaluation, to our knowledge, to assess Nanopore long-read only genome assembly compared with hybrid assembly on a large number of clinical isolates. In addition, this is the first large-scale evaluation of the recently released automated consensus long-read assembly tool, Autocycler.
We show that Autocycler long-read only assemblies are more structurally complete for chromosomal sequences, while reconstructing a similar number of plasmids to other long-read and hybrid assemblers. Most long-read polished, Autocycler-assembled genome sequences have 0 errors (median: 0 SNPs/indels) relative to a short-read polished (hybrid) Autocycler assemblies, enabling accurate annotation of key genes.