Discovering the unseen: a performance comparison of taxonomic classification methods under unknown DNA barcodes
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
DNA barcoding and metabarcoding have emerged as cost-efficient, standardized methods for characterizing local biodiversity. Based on the sequencing of a small targeted gene fragment, it is theoretically possible to identify a wide diversity of taxa by comparing them with reference sequence databases. However, a key challenge for accurate taxonomic classification is the incompleteness of such databases, leading to most query sequences lacking species-level matches.
Where species-level matches are missing, it may be possible to classify query sequences to a higher taxonomic level, such as genus or family, based on the similarity of related reference taxa. The challenge then lies in confidently recognizing whether the sequence belongs to an unobserved (here, “novel”) taxon on a given taxonomic level.
In this study, we evaluated the performance and utility of several methods for taxonomic classification. Methods were assessed based on the classification accuracy of both observed and novel taxa, training time, space requirements, and run time. We did this for two cases: the COI barcode for arthropods, and the ITS barcode for fungi, with the latter representing an instance with substantially greater variation within classes. To test classification of novel taxa, we used well-curated datasets with partially distinct taxonomic distribution. Novel taxa were present at multiple taxonomic levels, including genera, families, and orders. We further assessed the effect on performance when shifting from full-length barcodes to shorter sequences as generated through metabarcoding in the testing dataset.
This study sheds light on the strengths and limitations of different classification algorithms across varied ecological contexts and provides valuable guidance for researchers in selecting suitable algorithms for DNA barcoding and metabarcoding applications. In particular, it demonstrates the supreme performance of phylogenetic placement methods such as EPA-ng for classification of COI barcodes, and composition-based classifiers such as SINTAX, RDP, and IDTAXA for ITS.