Optimizing the Clinical Application of Rheumatology Guidelines Using Large Language Models: A Retrieval-Augmented Generation Framework Integrating ACR and EULAR Recommendations
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objectives
To develop and evaluate a Retrieval-Augmented Generation (RAG) system integrating European Alliance of Associations for Rheumatology (EULAR) and American College of Rheumatology (ACR) guidelines to provide rheumatologists with timely, evidence-based recommendations at the point of care.
Methods
EULAR and ACR and management guidelines were selected by rheumatologists according to relevance to clinical decision making, processed, and chunked. A RAG system using LangChain framework, voyage-3 embedding model, and a Qdrant vector database was implemented. Answers to 740 guideline-specific questions were generated by ChatGPT-o3-mini with context retrieval (RAG) and without (baseline). Performance was evaluated using an LLM-as-a-judge (Gemini 2.0 Flash) assessing factual accuracy, safety, completeness, faithfulness, and preference, with Wilcoxon signed-rank and Binomial tests for statistical significance.
Results
After agreement, 74 guidelines were included. The RAG-based system received consistently higher or comparable medians than the baseline across all criteria, relevance, factual accuracy, safety, completeness and conciseness (p<0.001). Moreover, the RAG-based system was significantly preferred by the LLM-judge in 92.8% of comparisons (p<0.001).
Conclusion
This study demonstrates the successful development and validation of a RAG system integrating extensive ACR/EULAR guidelines. The system significantly improves answer quality compared to a baseline LLM, providing a promising foundation for reliable, AI-driven clinical decision support tools in rheumatology to enhance guideline adherence.
Key messages
-
Large language models, combined with EULAR and ACR guidelines, may enhance rheumatology clinical decision support.
-
Retrieval augmented generation (RAG) responses showed significantly greater accuracy, safety and completeness than baseline LLMs.
-
RAG is a promising architecture for reducing hallucinations and providing grounded, reliable answers.