Automating the Curation of DNA Barcode Databases for Vascular Plants
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Comprehensive, curated, and current DNA barcode reference databases are essential for both the identification of single specimens and for the interpretation of metabarcoding data. In the case of plants, nuclear (ITS) and plastid (rbcL, matK) markers are commonly utilized in union. Because the plastid regions are segments of protein-coding genes, their alignment and analysis are usually straightforward. By contrast, the assembly and validation of records for ITS is considerably more difficult for two reasons: the prevalence of indels and the presence of intraindividual variation. This complexity has provoked the development of several workflows to support the curation of reference databases for the internal transcribed spacer (ITS) region for plant barcoding. However, the pipelines used to create these databases lack functionalities which are essential to ensure a solid post-analytical validation. This paper presents a new workflow to address these shortcomings, with the goal of enhancing the reliability and accuracy of plant barcoding studies. We furthermore demonstrate that clustering of reference databases results in a substantial drop in the fraction of queries that gain a correct species-level assignment. By contrast, setting an acceptance threshold for identifications, based on the distance between query and match provides a meaningful reduction of error rates in incomplete reference databases.