RECODE - Relational Ecological COrpus for Data Extraction
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Ecology, conservation biology, and related disciplines are inherently data-based, with the success of many research projects and initiatives (e.g., protected areas, monitoring plans, etc. ) being directly dependent on the availability of location and trait information on species and populations. Unfortunately, this data is often either nonexistent or available only as unstructured text within publications, especially for megadiverse taxa such as many invertebrate orders. With the emergence of Large Language Models there have been many attempts to automatically parse such data in machine-readable formats with variable success, either using prompt engineering or training models fit-for-purpose through named entity recognition (NER) and relation extraction. Model training has proven more efficient for complex data relations but it needs labelled corpora, i.e. , curated training data containing examples of this information for models to statistical learn from. This is a time-consuming process and, to our knowledge, no standard datasets exist upon which to train new and increasingly better models being released at an increasingly faster pace.
Here we describe RECODE, a manually annotated corpus of ecological and taxonomic literature, aimed at training and fine-tuning models for automated extraction of occurrence and trait data from unstructured text. All documents presented at this stage have been annotated and validated by experts familiar with the traits of the test taxa (spiders and insects).