Identification of biomedical entities from multiple repositories using a specialized metadata schema and search-augmented large language models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objective
Many biomedical articles reference multiple datasets across different public repositories, complicating accurate metadata capture and downstream re‐use. Building on our prior grounded large language model (LLM) workflows for biomedical entity annotation, we extend the approach to identify and annotate all datasets referenced by a paper, even when distributed across repositories, by combining a specialized metadata schema with a three‐step, search‐augmented prompting strategy.
Results
In the Transregional Collaborative Research Center PILOT (TRR 359 “Perinatal Development of Immune Cell Topology”), Gene Expression Omnibus (GEO) releases are common alongside additional repository deposits. The applied approach reliably detected datasets referenced in articles and produced schema‐compliant annotations using information available on the repository landing pages. After validation through structured face-to-face interviews with the article’s senior author, Gemini 2.5 Pro achieved higher precision (97.1%) than GPT‐4.1 (81.9%, p<0.001) and Claude Sonnet 4 (88.6%, p<0.001). Limiting the annotation to the information available in the repositories achieved higher precision than adding information from the article (91.9 % vs. 88.3% across all LLMs, p=0.004). These results indicate that simple repository‐grounded extraction enables high quality, multi‐dataset metadata annotation which has the potential to minimize the time and effort required for manual metadata annotation.
