Evaluation of T2DM Phenotyping Using Optimized Retrieval-Augmented Generation (RAG) and the Impact of Embedding Model, Context, and Prompt
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objective
Identification of patient cohorts from EHRs is challenging because ICD codes primarily serve billing and may misrepresent disease status, while key information is buried in unstructured notes. Existing computed phenotyping methods also have limitations in maintenance and incomplete modeling. We evaluated GPT-4o’s type II diabetes mellitus (T2DM) phenotyping ability using optimized Retrieval-Augmented Generation (RAG).
Methods
We built a RAG pipeline and clinical notes were loaded for 275 patients screened by T2DM ICD codes. We optimized chunk size and top-k across seven embedding models, testing 308 RAG configurations using training patients. Prompts (zero-shot and few-shot) were developed via error analysis. GPT-4o’s phenotyping performance was evaluated against ICD codes and PheNorm, within the optimized RAG framework. Token usage and sensitivity to key hyperparameters were also assessed.
Results
GPT-4o with optimized RAG significantly outperformed ICD in precision (PPV: 0.940), and PheNorm in sensitivity (0.902), NPV (0.697), and F1 (0.920), while PPV was slightly lower and specificity (0.791) needs improvement compared to PheNorm. General embedding models and zero-shot prompt presented better sensitivity, NPV, and F1-scores, while domain-specific models and a few-shot prompt excelled in specificity and PPV. Optimization enabled lower-ranked embedding models to achieve comparably good performance to the highest ones. Gte-Qwen2-1.5B-instruct and GatorTronS provided the highest token-efficiency in specific metrics. Error analysis revealed contextual misinterpretation and ranking issues.
Conclusion
GPT-4 using optimized RAG showed superior in T2DM phenotyping in key metrics. This study provides valuable insights into practical guidance of using RAG, while identifying limitations in errors LLM reasoning and retrieval ranking.