A Comprehensive Evaluation of LLM Phenotyping Using Retrieval-Augmented Generation (RAG): Insights for RAG Optimization

Heekyong Park
Martin Rees
Nils Kruger
Kenshiro Fuse
Victor M. Castro
Vivian Gainer
Nich Wattanasin
Barbara Benoit
Kavishwar B. Wagholikar
Shawn N. Murphy

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective

ICD codes are commonly used to filter patient cohorts but may not accurately reflect disease presence. Furthermore, many health problems are recorded in unstructured clinical notes, complicating cohort discovery from EHR data. Existing computed phenotyping methods have limitations in identifying evolving disease patterns and incomplete modeling. This study explores the potential of LLMs, by evaluating GPT-4o’s type II diabetes mellitus (T2DM) phenotyping ability using Retrieval-Augmented Generation (RAG).

Methods

A RAG system was built, leveraging 275 patients’ entire notes. We performed total 336 experiments to study the sensitivity of RAG to various chunk sizes, the number of chunks, and prompts across seven embedding models. Then the effectiveness of GPT-4o in T2DM phenotyping was assessed using optimized RAG configurations, comparing with ICD code and PheNorm phenotype performance. Token usage was also evaluated.

Results

The results show that GPT-4o with optimized RAG significantly outperformed ICD-10 and PheNorm in sensitivity, NPV, and F1, although PPV and specificity need improvement. When used with general embedding models or a zero-shot prompt, the results showed better sensitivity, NPV, and F1-scores, while domain-specific models and a few-shot prompt excelled in specificity and PPV. Furthermore, RAG optimization allowed lower-ranked embedding models achieve reliable performance. Gte-Qwen2-1.5B-instruct and GatorTronS provided the highest performance in specific evaluation metrics at a substantially lower cost.

Conclusion

Optimized RAG configurations significantly enhanced key performance metrics compared to existing methods. This study provides valuable insights into optimal configurations and cost-effective embedding model choices, while identifying limitations such as ranking issues and contextual misinterpretation by LLM.

Version published to 10.1101/2025.04.29.25326696v3 on medRxiv
May 8, 2025
Version published to 10.1101/2025.04.29.25326696v1 on medRxiv
May 1, 2025
Version published to 10.1101/2025.04.29.25326696v2 on medRxiv
May 1, 2025

RAGnosis: Retrieval-Augmented Generation for Enhanced Medical Decision Making

This article has 5 authors:
1. Amir Rouhollahi
2. Ali Homaei
3. Aanchal Sahu
4. Rayan Ebnali Harari
5. Farhad R. Nezami
This article has no evaluationsLatest version Jun 12, 2025
Implementing a context-augmented large language model to guide precision cancer medicine

This article has 16 authors:
1. Hyeji Jun
2. Yutaro Tanaka
3. Shreya Johri
4. Filipe LF Carvalho
5. Alexander C. Jordan
6. Chris Labaki
7. Matthew Nagy
8. Tess A. O’Meara
9. Theodora Pappa
10. Erica Maria Pimenta
11. Eddy Saad
12. David D Yang
13. Riaz Gillani
14. Alok K. Tewari
15. Brendan Reardon
16. Eliezer Van Allen
This article has no evaluationsLatest version May 11, 2025
ARE LLMS READY FOR PEDIATRICS? A COMPARATIVE EVALUATION OF MODEL ACCURACY ACROSS CLINICAL DOMAINS

This article has 5 authors:
1. Gianluca Mondillo
2. Simone Colosimo
3. Alessandra Perrotta
4. Vittoria Frattolillo
5. Mariapia Masino
This article has no evaluationsLatest version Apr 26, 2025

Listed in

Abstract

Objective

Methods

Results

Conclusion

Article activity feed

Related articles

RAGnosis: Retrieval-Augmented Generation for Enhanced Medical Decision Making

Implementing a context-augmented large language model to guide precision cancer medicine

ARE LLMS READY FOR PEDIATRICS? A COMPARATIVE EVALUATION OF MODEL ACCURACY ACROSS CLINICAL DOMAINS