Metadata Harmonization from Biological Datasets with Language Models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Biomedical research faces significant challenges in harmonizing metadata across diverse datasets due to inconsistent labeling and the lack of universally adopted ontologies. Conventional solutions, such as Common Data Elements, face adoption difficulties as they impede scientific progress by requiring researchers to navigate through thousands of standardized terms with subtle variations. Tools such as laboratory information management systems, while designed to enforce standardization, can hinder research progress when their rigid standards conflict with domain-specific documentation needs and evolving research practices. As a result of these challenges, researchers maintain their own annotation systems, leading to disconnected datasets that are difficult to integrate across studies.
This study presents a novel approach using large language models to automatically standardize researcher annotations to standards within ontologies. The approach is applied to multiple domains such as oncology, alcohol research, and infectious disease. Data augmentation strategies are presented to align training data with the space of human representations. These strategies generate realistic variations of standard terms to simulate how researchers naturally document their work, especially valuable in domains lacking the extensive terminology mappings needed for training language models. Experiments with fine-tuned GPT-2 variants show up to 96% accuracy on in-dictionary tasks and 17% on out-of-dictionary tasks, outperforming traditional techniques and zero-shot GPT-4o applications. This implies that there can be up to a 96% reduction in metadata standardization labor if a term exists in an ontology. We also show a significant trade-off between domain-specific models versus those that aim to generalize across domains such as infectious disease or alcohol research. While larger models excel at generalization, fine-tuned models consistently outperform on domain-specific terminology. This approach enables more efficient and accurate research data integration across biomedical fields, though out-of-dictionary generalization remains a challenge across all model sizes.