A Bioinformatic Pipeline for Consensus Taxonomic Classification of Long-Read Amplicons
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Characterizing community composition is fundamental to understanding microbial community function. Recent advances in Oxford Nanopore Technology (ONT) long-read sequencing now allow community profiling using full-length gene amplicons, affording better taxonomic resolution than standard short-amplicon Illumina sequencing. However, robust ONT-compatible profiling workflows are lacking. To address this, we have created the Amplicon Consensus Taxonomy (ACT) pipeline for classifying long-read amplicons. ACT combines output from three existing pipelines – Emu, Sintax, and LACA – to leverage the strengths of each while offsetting their individual limitations. We also developed the ACT database (ACT-DB), a sequence-similarity-aware reference database that clusters highly similar sequences into multi-taxa groups to reduce overclassification. We benchmarked ACT performance against Emu and Sintax using a defined simple mock community, simulated datasets, and a complex rhizosphere community supplemented with novel species. While ACT exhibited generally comparable or superior performance across datasets, ACT demonstrated a marked advantage over Emu and Sintax in identifying novel and low-abundance taxa in both simple and complex communities, resulting in significantly higher species-richness estimates that better reflected those observed in prior Illumina amplicon studies. Furthermore, by clustering ambiguous reference sequences, ACT-DB allowed ACT to resolve reads to meaningful multi-species groups, improving resolution without coercing artificial precision. Together, ACT and ACT-DB form a robust long-read amplicon profiling workflow that confidently identifies known species while reducing overclassification and preserving low-abundance and unknown taxa.
IMPORTANCE
Microbial communities are frequently characterized by amplicon sequencing of marker genes, such as the bacterial 16S rRNA gene and fungal ITS region. Historically, the standard profiling method has been Illumina sequencing of 200-300 bp amplicons, but improved accuracy of ONT long-read sequencing means it is now possible to sequence amplicons spanning full genes of any size, prompting the need for tools optimized for long amplicons. Here, we describe the ACT bioinformatic pipeline for assigning taxonomy to amplicons of any length. We evaluated ACT performance using full-length 16S amplicon data relative to that of two commonly used pipelines. Additionally, we developed a sequence ambiguity-aware ACT database (ACT-DB) of 16S rRNA sequences to further improve classification accuracy and resolution.