Benchmarking the impact of reference genome selection on taxonomic profiling accuracy
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Over the past decades, genome databases have expanded exponentially, often incorporating highly similar genomes at the same taxonomic level. This redundancy can hinder taxonomic classification, leading to difficulties distinguishing between closely related sequences and increasing computational demands. While some novel taxonomic classification tools address this redundancy by selecting a subset of genomes as references, insights regarding the impact of different reference genome selection methods across taxonomic classification tools are lacking.
Results
We systematically evaluate genome selection and dereplication methods on bacterial and viral datasets using simulated metagenomic samples. We show that the impact of reference genome selection is strongly context-dependent. For bacterial profiling, incorporating all available genomes generally yields the highest accuracy, while having a limited impact on computational resource usage. In contrast, for highly redundant SARS-CoV-2 datasets we find that stringent hierarchical clustering-based selection significantly improves lineage-level abundance estimation accuracy. Incorporation of location-based metadata further enhances viral profiling performance by prioritizing locally relevant genomes. Across viral experiments, smaller reference sets significantly reduce memory and runtime requirements during both indexing and profiling, although this comes at an additional pre-processing cost.
Conclusions
Reference genome selection influences both accuracy and computational efficiency in taxonomic profiling, although its benefits seem context-dependent. In diverse bacterial communities, comprehensive reference sets appear optimal, whereas in redundant viral datasets smaller and metadata-informed reference sets work best. These results demonstrate that reference set design does not have a one-size-fits-all solution, and that selection strategies should be adapted based on the biological and computational setting.