Development and validation of Retrieval Augmented Generation (RAG) and GraphRAG for complex clinical cases
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objective
Chronic Kidney Disease (CKD) is a progressive condition requiring evidence-based management, but adherence to complex guidelines remains challenging. Large Language Models (LLMs) could support clinical decision-making, yet their unreliability limits direct use. This study aimed to evaluate whether Retrieval-Augmented Generation (RAG), particularly a knowledge graph-enhanced pipeline (GraphRAG), improves guideline-based clinical decision support (CDS) in CKD management.
Methods and Analysis
We compared three approaches: a baseline LLM (GPT-4o), a vector-indexed RAG pipeline, and a GraphRAG pipeline. Each model answered nine clinically relevant questions for a synthetic cohort of 70 CKD patients. Outputs were assessed for clinical correctness, patient-specificity, and clarity, using both clinician-led evaluations and an LLM-as-Judge framework.
Results
RAG-based methods outperformed the baseline LLM in clinical correctness and guideline adherence. GraphRAG achieved the highest patient-specificity by leveraging multi-hop relationships across a knowledge graph derived from NICE CKD guidelines, particularly for tasks involving thresholds, algorithmic decisions, or open-ended management. However, GraphRAG scored lower in clarity, as its graph walks often returned long guideline excerpts that obscured key recommendations. All RAG systems were limited by the scope of the indexed guideline and performed poorly when essential information was missing.
Conclusions
RAG and GraphRAG provide a scalable, auditable foundation for guideline-aligned CDS in CKD, with GraphRAG showing particular strengths in tailoring advice to patient data. Nonetheless, trade-offs remain between specificity and clarity, and effective deployment will require robust content management, transparent validation pipelines, and integration within established clinical governance frameworks.
Key points
- LLMs have comprehensive medical knowledge but require access to up-to-date, evidence-based, and locally relevant guidelines to be effective in CDS.
- Hallucinations (the generation of inaccurate or misleading information) remain a major limitation for LLMs in healthcare.
- Traditional information retrieval methods face several challenges in providing accurate, context-specific evidence.
- Retrieval-Augmented Generation (RAG) and graph-based RAG approaches have emerged as promising solutions to overcome these limitations.
- Renal medicine provides an ideal test domain to evaluate these models, given its complexity and reliance on nuanced, multidisciplinary decision-making.
- Studying LLM performance in kidney health can yield valuable insights into how such models can safely and effectively support complex clinical decision-making.