Automated Harmonization and Large-Scale Integration of Heterogeneous Biomedical Sample Metadata Using Large Language Models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The exponential growth of biomedical data has created an urgent need for efficient integration and analysis of heterogeneous sample metadata across studies. However, current methods for harmonizing and standardizing these metadata are largely manual, time-consuming, and prone to inconsistencies. Here, we present a novel computational framework that leverages large language models (LLMs) to automate the harmonization and large-scale integration of diverse biomedical sample metadata. Our approach combines semantic clustering techniques with LLM-driven natural language processing to extract, interpret, and standardize metadata from various sources, including research papers, supplementary tables, and text data from public databases. We demonstrate the efficacy of our framework by applying it to thousands of human gut microbiome papers, successfully extracting and integrating metadata from over 400,000 samples. Our method achieved a 50% recovery rate of manually curated metadata, significantly outperforming traditional rule-based methods. Furthermore, our framework enabled the creation of a unified, searchable database of standardized metadata, facilitating cross-study analyses and revealing previously obscured patterns in microbiome composition across diverse populations and conditions. The scalability and adaptability of our approach suggest its potential applicability to a wide range of biomedical fields, potentially accelerating meta-analyses and fostering new insights from existing data. This work represents a significant advancement in biomedical data integration, offering a powerful tool for researchers to unlock the full potential of accumulated scientific knowledge.