From Spreadsheets and Bespoke Models to Enterprise Data Warehouses: GPT-enabled Clinical Data Ingestion into i2b2

Taowei David Wang
Shawn N. Murphy
Victor M. Castro
Jeffrey G. Klann

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective

Clinical and phenotypic data available to researchers are often found in spreadsheets or bespoke data models. Bridging these to enterprise data warehouses would enable sophisticated analytics and cohort discovery for users of platforms like NHGRI’s Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVlL). We combine data mapping methodologies, biomedical ontologies, and large language models (LLMs) to load these data into Informatics for Integrating Biology and the Bedside (i2b2), making them available to AnVIL users.

Materials and Methods

We developed few-shot prompts for ChatGPT-4o to generate Python scripts that facilitate the extract, transform, and load (ETL) process into i2b2. The scripts first convert a designated data dictionary (in various formats) into an intermediate common format, and then into an i2b2 ontology. Finally, the original data file is converted into i2b2 facts, using standard ontologies hosted by the National Center for Biomedical Ontology (NCBO).

Results

ChatGPT-4o correctly produced Python code to facilitate ETL. We converted phenotype data from three synthetic datasets from three disparate data models available in AnVIL. Our prompts generated scripts which successfully converted data on 3,458 fake patients, making it queryable in i2b2.

Discussion

For a few datasets, iterative prompt refinement might reduce ETL efficiency gains. However, prompt reuse significantly reduces incremental effort for additional data models. At scale, we anticipate our pipeline offers substantial time savings, which could transform future ETL workflows.

Conclusion

We developed an LLM-powered ETL pipeline to convert disparate datasets into i2b2 format, enabling advanced analytics and cohort discovery across heterogeneous data models.

Version published to 10.1101/2025.04.17.25325962v1 on medRxiv
Apr 19, 2025

sampleclusteR: A lightweight R package for automated clustering of transcriptomics samples using metadata

This article has 3 authors:
1. Brandon Coke
2. Mahesan Niranjan
3. Rob M. Ewing
This article has no evaluationsLatest version Apr 16, 2025
A time-sequenced approach to machine learning prognostic modelling with implementation on running-related injury prediction

This article has 7 authors:
1. Han Wu
2. Katherine Brooke-Wavell
3. Michael R. Barnes
4. Zainab Awan
5. Sarabjit Mastana
6. Sam Allen
7. Richard C. Blagrove
This article has no evaluationsLatest version May 27, 2025
Barriers and facilitators for digital health medical device registration in the UK: A scoping review

This article has 7 authors:
1. Madison Milne-Ives
2. Katie Bounsall
3. Ananya Ananthakrishnan
4. Rosiered Brownson-Smith
5. Cen Cong
6. Camille Carroll
7. Edward Meinert
This article has no evaluationsLatest version Apr 16, 2025

Listed in

Abstract

Objective

Materials and Methods

Results

Discussion

Conclusion

Article activity feed

Related articles

sampleclusteR: A lightweight R package for automated clustering of transcriptomics samples using metadata

A time-sequenced approach to machine learning prognostic modelling with implementation on running-related injury prediction

Barriers and facilitators for digital health medical device registration in the UK: A scoping review