Identification of biomedical entities from multiple repositories using a specialized metadata schema and search-augmented large language models

Klaus Kaier
Felix Engel
Gita Benadi
Claudia Giuliani
Manuel Watter
Aref Kalantari
Karin Schuller
Claus-Werner Franzke
Markus Sperandio
Harald Binder

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective

Many biomedical articles reference multiple datasets across different public repositories, complicating accurate metadata capture and downstream re‐use. Building on our prior grounded large language model (LLM) workflows for biomedical entity annotation, we extend the approach to identify and annotate all datasets referenced by a paper, even when distributed across repositories, by combining a specialized metadata schema with a three‐step, search‐augmented prompting strategy.

Results

In the Transregional Collaborative Research Center PILOT (TRR 359 “Perinatal Development of Immune Cell Topology”), Gene Expression Omnibus (GEO) releases are common alongside additional repository deposits. The applied approach reliably detected datasets referenced in articles and produced schema‐compliant annotations using information available on the repository landing pages. After validation through structured face-to-face interviews with the article’s senior author, Gemini 2.5 Pro achieved higher precision (97.1%) than GPT‐4.1 (81.9%, p<0.001) and Claude Sonnet 4 (88.6%, p<0.001). Limiting the annotation to the information available in the repositories achieved higher precision than adding information from the article (91.9 % vs. 88.3% across all LLMs, p=0.004). These results indicate that simple repository‐grounded extraction enables high quality, multi‐dataset metadata annotation which has the potential to minimize the time and effort required for manual metadata annotation.

Version published to 10.1101/2025.10.21.25338460 on medRxiv
Oct 23, 2025

A metadata schema for documenting material samples from multiple domains

This article has 14 authors:
1. Stephen Richard
2. Dave Vieglais
3. Andrea Thomer
4. Sarah Hyunju Song
5. Neil Davies
6. Quan Gan
7. Eric Kansa
8. Sarah Kansa
9. John Kunze
10. Kerstin Lehnert
11. Danny Mandel
12. Chris Meyer
13. Rebecca Snyder
14. Ramona Walls
This article has no evaluationsLatest version Feb 4, 2026
Intelligent Semantic Search Engine for Biomedical Literature and Clinical Trials: A Comprehensive Hybrid Retrieval Framework

This article has 1 author:
1. Sasidhara Kashyap Chaturvedula
This article has no evaluationsLatest version Jan 29, 2026
Ontology-Driven Semantic Alignment: Assessing the Reasoning Capabilities of Large Language Models in Geospatial Contexts

This article has 2 authors:
1. Fabíola Andrade Souza
2. Silvana Philippi Camboim
This article has no evaluationsLatest version Dec 22, 2025

Discuss this preprint

Listed in

Abstract

Objective

Results

Article activity feed

Related articles

A metadata schema for documenting material samples from multiple domains

Intelligent Semantic Search Engine for Biomedical Literature and Clinical Trials: A Comprehensive Hybrid Retrieval Framework

Ontology-Driven Semantic Alignment: Assessing the Reasoning Capabilities of Large Language Models in Geospatial Contexts