MetaMuse: A Multi-Agent AI System for Biomedical Metadata Curation and Harmonization

Ekansh Mittal
Elon Litman
Tyler Myers
Vinayak Agarwal
Ashwin Gopinath
Timothy Kassis

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Inconsistent and unstructured metadata in public biomedical repositories, such as the Gene Expression Omnibus (GEO), severely limits data discoverability and research reproducibility. To address this, we introduce M eta M use , a modular, multi-agent artificial intelligence framework designed to autonomously extract, validate, and standardize unstructured biomedical metadata. Operating through a three-stage architecture utilizing large language model agents, specialized C urator A gents contextually extract candidate values for specific target metadata fields. A centralized A rbitrator A gent enforces cross-field logical consistency to prevent contradictory annotations. Finally, a N ormalizer A gent leveraging a domain-specific semantic search model (SapBERT) maps these free-text candidates to formal ontological terms. We evaluated M eta M use on a gold-standard dataset of manually curated GEO samples, achieving over 95% curation accuracy across key target metadata fields, and demonstrated robust scalability on a broader dataset of 400 samples. Notably, M eta M use avoids data hallucination by defaulting to conservative false negatives when evidence is ambiguous, thereby preserving strict data integrity. By providing a fully auditable and context-aware curation pipeline, M eta M use offers a scalable solution for enriching public data repositories and accelerating reproducible, data-driven scientific discovery.

Version published to 10.64898/2026.04.12.718044 on bioRxiv
Apr 15, 2026

CROssBARv2: A Unified Computational Framework for Heterogeneous Biomedical Data Representation and LLM-Driven Exploration

This article has 9 authors:
1. Bünyamin Şen
2. Erva Ulusoy
3. Melih Darcan
4. Mert Ergün
5. Sebastian Lobentanzer
6. Ahmet S. Rifaioglu
7. Dénes Türei
8. Julio Saez-Rodriguez
9. Tunca Doğan
This article has no evaluationsLatest version Apr 15, 2026
All Models are Wrong, Some are Annotated: Automating Metadata in Biomedical Repositories

This article has 3 authors:
1. Inessa Cohen
2. Hongyi Yu
3. Robert A. McDougal
This article has no evaluationsLatest version Apr 27, 2026
Agentic Authoring of OMOP Concept Sets from Natural Language

This article has 6 authors:
1. Hongyu Chen
2. Xing He
3. Hao Dai
4. Yu Huang
5. Mei Liu
6. Jiang Bian
This article has no evaluationsLatest version Jun 3, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

CROssBARv2: A Unified Computational Framework for Heterogeneous Biomedical Data Representation and LLM-Driven Exploration

All Models are Wrong, Some are Annotated: Automating Metadata in Biomedical Repositories

Agentic Authoring of OMOP Concept Sets from Natural Language