Application of Computer Vision to the Automated Extraction of Metadata from Natural History Specimen Labels: A Case Study on Herbarium Specimens
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Metadata extraction from the labels of natural history collections is a pivotal task for the online publication of digitized specimens. However, this is an extremely time-consuming task, given the estimated number of specimens in natural history collections (more than 2 billion specimens worldwide, of which ca. 400 million are herbarium specimens). Thus, automated data extraction from digital images of specimens and their labels is an application where state-of-the-art computer vision techniques could successfully be applied. The task of extracting information from the labels of herbarium specimens is made of three steps: text segmentation, multilingual/handwriting recognition, and data parsing. The principal bottleneck in the process is the limitation of Optical Character Recognition (OCR). This study aims to explore how to transfer the general knowledge present in multimodal Transformers into the specific sub-task of herbarium specimen label digitization. This would result in an easy-to-use, end-to-end solution, which strives to get rid of the bottleneck of classic OCR systems, while allowing for higher flexibility to adapt to different label formats. Donut-base, a pre-trained visual document understanding (VDU) transformer, was the base model selected for fine-tuning. A dataset from the University of Pisa was used as a test bed. The initial attempt achieved an 85% accuracy computed by the Tree Edit Distance (TED), demonstrating that fine-tuning is a feasible solution. Cases with low accuracies were also investigated to highlight flaws in the approach. Specimens with more than one label, especially when a mix of different handwriting and typewritten information were present, are the most difficult to deal with, and approaches aimed at targeting these weaknesses are discussed.