Metadata Harmonization from Biological Datasets with Language Models

Alexander Verbitsky
Patrick Boutet
Mohammed Eslami

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Biomedical research faces significant challenges in harmonizing metadata across diverse datasets due to inconsistent labeling and the lack of universally adopted ontologies. Conventional solutions, such as Common Data Elements, face adoption difficulties as they impede scientific progress by requiring researchers to navigate through thousands of standardized terms with subtle variations. Tools such as laboratory information management systems, while designed to enforce standardization, can hinder research progress when their rigid standards conflict with domain-specific documentation needs and evolving research practices. As a result of these challenges, researchers maintain their own annotation systems, leading to disconnected datasets that are difficult to integrate across studies.

This study presents a novel approach using large language models to automatically standardize researcher annotations to standards within ontologies. The approach is applied to multiple domains such as oncology, alcohol research, and infectious disease. Data augmentation strategies are presented to align training data with the space of human representations. These strategies generate realistic variations of standard terms to simulate how researchers naturally document their work, especially valuable in domains lacking the extensive terminology mappings needed for training language models. Experiments with fine-tuned GPT-2 variants show up to 96% accuracy on in-dictionary tasks and 17% on out-of-dictionary tasks, outperforming traditional techniques and zero-shot GPT-4o applications. This implies that there can be up to a 96% reduction in metadata standardization labor if a term exists in an ontology. We also show a significant trade-off between domain-specific models versus those that aim to generalize across domains such as infectious disease or alcohol research. While larger models excel at generalization, fine-tuned models consistently outperform on domain-specific terminology. This approach enables more efficient and accurate research data integration across biomedical fields, though out-of-dictionary generalization remains a challenge across all model sizes.

Version published to 10.1101/2025.01.15.633281v1 on bioRxiv
Jan 20, 2025

BioHackEU24 report: Expanding FAIR database integration through elucidation and transformation of underlying graph schemas

This article has 9 authors:
1. Javier Millán Acosta
2. Shuichi Kawashima
3. Toshiaki Katayama
4. Jerven Bolleman
5. Dominik Martinat
6. Harald Detering
7. Jose Emilio Labra Gayo
8. Yojana Gadiya
9. Tooba Abbassi-Daloii
This article has no evaluationsLatest version May 17, 2025
FAIR in practice: minimum metadata schema for bioinformatics analytics by machines

This article has 10 authors:
1. Daphne Wijnbergen
2. Núria Queralt-Rosinach
3. Valérie Barbié
4. Emma Verkinderen
5. Nirupama Benis
6. Annika Jacobsen
7. Peter A.C. ’t Hoen
8. Claudio Carta
9. Marco Roos
10. Eleni Mina
This article has no evaluationsLatest version May 7, 2025
From Form to Formation. Biomedical Reporting Standards in Practice

This article has 1 author:
1. Alexander Schniedermann
This article has no evaluationsLatest version Jun 23, 2025

Listed in

Abstract

Article activity feed

Related articles

BioHackEU24 report: Expanding FAIR database integration through elucidation and transformation of underlying graph schemas

FAIR in practice: minimum metadata schema for bioinformatics analytics by machines

From Form to Formation. Biomedical Reporting Standards in Practice