Curating 16S rRNA databases enhances taxonomic accuracy and computational efficiency in microbial profiling

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The 16S rRNA gene serves as the gold standard molecular marker for microbial profiling, yet taxonomic assignment accuracy depends critically on reference database quality. Substantial heterogeneity exists among databases in sequence coverage, curation standards, and taxonomic nomenclature, leading to conflicting taxonomic assignments. Despite previous comparisons highlighting performance differences, the impact of database preprocessing, including sequence cleaning and redundancy removal, on taxonomic classification remains understudied. To improve database quality, we implemented novel cleaning approaches to remove nested sequences, duplicate sequences, and correct missing taxonomic nomenclature. We compared four major 16S databases (SILVA, Greengenes2, RefSeq, and MIMt) using 69 mock communities and the DADA2 analysis pipeline for microbial genus-level profiling. Database size was significantly reduced after cleaning: SILVA reduced from 452,055 to 291,733 sequences, Greengenes2 from 337,506 to 277,982 sequences, MIMt from 48,749 to 34,734 sequences, and RefSeq from 27,376 to 25,970 sequences. Greengenes2, MIMt, and RefSeq, which exhibited comparable performance, consistently outperformed SILVA in recall, precision, and abundance estimation accuracy across all sample types. Cleaning SILVA improved computational efficiency by up to 50% while maintaining classification performance. We provide a benchmarking framework with the cleaned databases as resources for accurate 16S rRNA analysis and profiling.

Article activity feed