Automated Harmonization and Large-Scale Integration of Heterogeneous Biomedical Sample Metadata Using Large Language Models

Koichi Higashi
Zenichi Nakagawa
Takuji Yamada
Hiroshi Mori

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The exponential growth of biomedical data has created an urgent need for efficient integration and analysis of heterogeneous sample metadata across studies. However, current methods for harmonizing and standardizing these metadata are largely manual, time-consuming, and prone to inconsistencies. Here, we present a novel computational framework that leverages large language models (LLMs) to automate the harmonization and large-scale integration of diverse biomedical sample metadata. Our approach combines semantic clustering techniques with LLM-driven natural language processing to extract, interpret, and standardize metadata from various sources, including research papers, supplementary tables, and text data from public databases. We demonstrate the efficacy of our framework by applying it to thousands of human gut microbiome papers, successfully extracting and integrating metadata from over 400,000 samples. Our method achieved a 50% recovery rate of manually curated metadata, significantly outperforming traditional rule-based methods. Furthermore, our framework enabled the creation of a unified, searchable database of standardized metadata, facilitating cross-study analyses and revealing previously obscured patterns in microbiome composition across diverse populations and conditions. The scalability and adaptability of our approach suggest its potential applicability to a wide range of biomedical fields, potentially accelerating meta-analyses and fostering new insights from existing data. This work represents a significant advancement in biomedical data integration, offering a powerful tool for researchers to unlock the full potential of accumulated scientific knowledge.

Version published to 10.1101/2024.10.26.620145v1 on bioRxiv
Oct 29, 2024

Harnessing Large Language Models for Structured Extraction of CYP–Substance Interactions from Biomedical Texts

This article has 3 authors:
1. Mariam Alkarmouty
2. Junya Ooka
3. Fumiyoshi Yamashita
This article has no evaluationsLatest version Jun 27, 2025
BioHackEU24 report: Expanding FAIR database integration through elucidation and transformation of underlying graph schemas

This article has 9 authors:
1. Javier Millán Acosta
2. Shuichi Kawashima
3. Toshiaki Katayama
4. Jerven Bolleman
5. Dominik Martinat
6. Harald Detering
7. Jose Emilio Labra Gayo
8. Yojana Gadiya
9. Tooba Abbassi-Daloii
This article has no evaluationsLatest version May 17, 2025
SummArIzeR: Simplifying cross-database enrichment result clustering and annotation via large language models

This article has 3 authors:
1. Marie Brinkmann
2. Michael Bonelli
3. Anela Tosevska
This article has no evaluationsLatest version Jun 1, 2025

Listed in

Abstract

Article activity feed

Related articles

Harnessing Large Language Models for Structured Extraction of CYP–Substance Interactions from Biomedical Texts

BioHackEU24 report: Expanding FAIR database integration through elucidation and transformation of underlying graph schemas

SummArIzeR: Simplifying cross-database enrichment result clustering and annotation via large language models