MetaMuse: A Multi-Agent AI System for Biomedical Metadata Curation and Harmonization

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Inconsistent and unstructured metadata in public biomedical repositories, such as the Gene Expression Omnibus (GEO), severely limits data discoverability and research reproducibility. To address this, we introduce M eta M use , a modular, multi-agent artificial intelligence framework designed to autonomously extract, validate, and standardize unstructured biomedical metadata. Operating through a three-stage architecture utilizing large language model agents, specialized C urator A gents contextually extract candidate values for specific target metadata fields. A centralized A rbitrator A gent enforces cross-field logical consistency to prevent contradictory annotations. Finally, a N ormalizer A gent leveraging a domain-specific semantic search model (SapBERT) maps these free-text candidates to formal ontological terms. We evaluated M eta M use on a gold-standard dataset of manually curated GEO samples, achieving over 95% curation accuracy across key target metadata fields, and demonstrated robust scalability on a broader dataset of 400 samples. Notably, M eta M use avoids data hallucination by defaulting to conservative false negatives when evidence is ambiguous, thereby preserving strict data integrity. By providing a fully auditable and context-aware curation pipeline, M eta M use offers a scalable solution for enriching public data repositories and accelerating reproducible, data-driven scientific discovery.

Article activity feed