Performance and limitations of out-of-distribution detection for insect DNA (meta)barcoding

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Successful applications of DNA barcoding/metabarcoding rely on the accurate taxonomic identification of sequence fragments. When biological surveys with DNA (meta)barcoding target underexplored biological communities, sequence-based identification is often conducted using incomplete databases that do not fully cover the regional species pool. Consequently, specimens to be identified may include species not present in reference databases. Such unknown or “out-of-distribution” samples can cause misidentification if left undetected. A similarity cutoff is commonly used to detect out-of-distribution samples before taxonomic assignment, but its effectiveness has not been carefully studied. In this study, we evaluated the performance of out-of-distribution detection for DNA barcoding with genetic distance and deep learning metrics. Using extensively sampled datasets of multiple insect taxa, we measured the performance of identification and out-of-distribution detection under conditions in which genetic variations in species were sufficiently sampled. Although identification with DNA barcoding is a highly accurate process, even with short noisy fragments, out-of-distribution detection was more susceptible to a reduction in performance due to sequence noise and a lack of diagnosable characters. Our results provide guidelines for designing unknown-proof identification procedures by determining factors affecting out-of-distribution detection performance.

Article activity feed