All Models are Wrong, Some are Annotated: Automating Metadata in Biomedical Repositories

Inessa Cohen
Hongyi Yu
Robert A. McDougal

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective

High-quality metadata is essential for scientific discovery, yet sparse annotations in rapidly growing repositories leave many biologically relevant details uncaptured. We evaluated whether large language models (LLMs) can accurately infer ion channel and receptor subtype metadata from source code in a neuroscience repository.

Materials and Methods

We extracted 5,133 model files from ModelDB. A subset of 1,100 was manually annotated; 253 were held out for testing, and the remainder split into training (80%) and validation (20%) sets. LLM-based approaches (GPT-5.2 and GPT-mini) were evaluated under zero-shot and heuristic-augmented prompting. Performance was assessed at type and subtype levels using accuracy, precision, recall, and F1 score. A feature-engineered XGBoost model using text- and simulation-derived features served as a baseline.

Results

LLMs outperformed the XGBoost baseline. At the type level, GPT-mini with heuristic augmentation achieved the highest performance (accuracy 96.0%, F1 0.962). At the subtype level, both GPT-5.2+heuristics and GPT-mini+heuristics achieved identical accuracy (88.1%), with GPT-5.2+heuristics achieving the highest F1(0.878). Model outputs were consistent across runs and errors confined to related mechanistic families.

Discussion and Conclusion

LLMs demonstrate strong potential for metadata annotation directly from source code, outperforming feature-engineering approaches with minimal tuning. However, performance varied across subtypes, and errors often reflected ambiguity or bias toward more common labels. These findings suggest LLMs may serve as practical tools for scalable metadata generation in biomedical repositories, although careful evaluation and domain-specific validation remain important. While demonstrated in computational neuroscience, this approach may generalize to repository-agnostic metadata annotation in other scientific code repositories.

Version published to 10.64898/2026.04.23.720371 on bioRxiv
Apr 27, 2026

BioAutoML-FAST: an automated machine-learning platform for reusable and benchmarked biological sequence models

This article has 7 authors:
1. Breno L. S. de Almeida
2. Robson P. Bonidia
3. Martin Bole
4. Anderson Avila-Santos
5. Peter F. Stadler
6. Ulisses N. da Rocha
7. André C. P. L. F. de Carvalho
This article has no evaluationsLatest version Apr 22, 2026
Extraction of Human Phenotype Ontology (HPO) Concepts from Clinical Notes Utilizing Large Language Models (LLM) with Model Context Protocol (MCP)

This article has 5 authors:
1. Michael Larsen
2. Ian M. Campbell
3. Lori A. Orlando
4. Peter Robinson
5. Nephi A. Walton
This article has no evaluationsLatest version May 25, 2026
Improving package annotation in metabolomics and proteomics via robust, ontology-driven LLM integration

This article has 8 authors:
1. Sebastian Lobentanzer
2. Helge Hecht
3. Vincent J Carey
4. Maria A Doyle
5. Alban Gaignard
6. Hervé MENAGER
7. Júlia Mir
8. Claire Rioualen
This article has no evaluationsLatest version Apr 14, 2026

Discuss this preprint

Listed in

Abstract

Objective

Materials and Methods

Results

Discussion and Conclusion

Article activity feed

Related articles

BioAutoML-FAST: an automated machine-learning platform for reusable and benchmarked biological sequence models

Extraction of Human Phenotype Ontology (HPO) Concepts from Clinical Notes Utilizing Large Language Models (LLM) with Model Context Protocol (MCP)

Improving package annotation in metabolomics and proteomics via robust, ontology-driven LLM integration