MetaMuse: A Multi-Agent AI System for Biomedical Metadata Curation and Harmonization
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Inconsistent and unstructured metadata in public biomedical repositories, such as the Gene Expression Omnibus (GEO), severely limits data discoverability and research reproducibility. To address this, we introduce M eta M use , a modular, multi-agent artificial intelligence framework designed to autonomously extract, validate, and standardize unstructured biomedical metadata. Operating through a three-stage architecture utilizing large language model agents, specialized C urator A gents contextually extract candidate values for specific target metadata fields. A centralized A rbitrator A gent enforces cross-field logical consistency to prevent contradictory annotations. Finally, a N ormalizer A gent leveraging a domain-specific semantic search model (SapBERT) maps these free-text candidates to formal ontological terms. We evaluated M eta M use on a gold-standard dataset of manually curated GEO samples, achieving over 95% curation accuracy across key target metadata fields, and demonstrated robust scalability on a broader dataset of 400 samples. Notably, M eta M use avoids data hallucination by defaulting to conservative false negatives when evidence is ambiguous, thereby preserving strict data integrity. By providing a fully auditable and context-aware curation pipeline, M eta M use offers a scalable solution for enriching public data repositories and accelerating reproducible, data-driven scientific discovery.