Accelerating metadata annotation in collaborative research centers: A hybrid AI workflow for biomedical entities
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Collaborative Research Centers rely on FAIR-compliant, richly structured metadata, yet manual annotation is a major bottleneck. We implemented an AI- and search-augmented large language model (LLM) workflow within a local research data management system to pre-annotate biomedical entities, using human-in-the-loop verification to ensure data quality. Methods The pipeline uses Gemini 3.0 Pro for a two-step prompting strategy: (1) identify dataset deposits and stable identifiers in articles converted to Markdown; (2) extract structured fields from curated repository landing pages rendered via a headless browser. To respect a highly hierarchical metadata schema, we flattened the schema for prompting and remapped outputs to strict JSON, with granular provenance tags. Authors received pre-filled metadata and could accept, edit, or delete entries (TP, FP, FN mapping). Performance metrics (precision, recall, F1) were estimated as proportions and synthesized via random-effects meta-analysis. The workflow was rolled out in December 2025 with reminders at 5 and 10 weeks. Results Among 51 screened articles (40 original articles, 11 review articles), the LLM identified a repository deposit in 31 articles; authors responded for 17 of these (55%), yielding 39 datasets with human verification. On the 39 verified datasets, the number of true positives averaged 13.15 (SD 4.57; range 6–27). False positives were rare, with a mean of 0.23 (SD 0.58; range 0–2). False negatives were also low, with a mean of 1.46 (SD 1.93; range 0–6). Precision was consistently high across datasets with an overall random-effects estimate of 99.65% (95% CI 98.42% to 100.00%) and no detectable heterogeneity (I² = 0.00%). Recall showed more variability, with an overall estimate of 93.75% (95% CI 89.79% to 96.96%) and moderate heterogeneity (I² = 55.08%). The combined performance, expressed as the F1 score, yielded an overall estimate of 96.17% (95% CI 93.78% to 98.11%). Conclusions The hybrid workflow achieved very high precision with moderately variable recall, effectively shifting effort from drafting to reviewing while preserving schema compliance. However, the modest author response rate limits sample size and generalizability; broader engagement and multi-site validation are needed to confirm robustness across domains.