Optimizing the Clinical Application of Rheumatology Guidelines Using Large Language Models: A Retrieval-Augmented Generation Framework Integrating EULAR and ACR Recommendations
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objectives
Timely access to current rheumatology guidelines at the point of care is challenging. We aimed to develop and evaluate the first Retrieval-Augmented Generation (RAG) system specifically designed for adult rheumatology, integrating European Alliance of Associations for Rheumatology (EULAR) and American College of Rheumatology (ACR) guidelines to provide rheumatologists with timely, evidence-based recommendations at the point of care.
Methods
EULAR and ACR management guidelines were selected by rheumatologists based on their clinical relevance for decision making and processed. A RAG system was implemented using LangChain framework, voyage-3 embedding model, and a Qdrant vector database. To evaluate the system, ten questions per guideline were generated using ChatGPT 4.5 . Answers to these guideline-specific questions were subsequently produced by ChatGPT-o3-mini with context retrieval (RAG) and without (baseline). Performance was assessed by an LLM-as-a-judge ( Gemini 2.0 Flash ) using a 5-point Likert scale across five dimensions: relevance, factual accuracy, safety, completeness, and conciseness. The judge also determined preference between the RAG and baseline responses. Statistical significance was established using Wilcoxon signed-rank and Binomial tests. For further validation, two blinded rheumatologists independently evaluated a random sample of questions (15%).
Results
After agreement, 74 guidelines were included, and 740 evaluation questions were generated. Analysis revealed that the RAG system significantly outperformed the baseline system across all criteria (p<0.001) in the LLM-as-a-judge evaluation. Manual evaluation by rheumatologists confirmed these findings (p<0.001 for accuracy, safety, completeness). Furthermore, the RAG system was significantly preferred by the LLM-as-a-judge in 92.8% of comparisons (p<0.001) and by the human evaluators in 71.2%-74.8% of comparisons (p<0.001).
Conclusion
This study demonstrates the successful development and evaluation of a RAG system integrating extensive EULAR/ACR guidelines for adult rheumatology. The system significantly improves answer quality compared to a baseline LLM. This provides a robust foundation for reliable, AI-driven clinical decision support tools designed to enhance guideline adherence and evidence-based practice in rheumatology by providing clinicians with rapid, context-aware access to recommendations.
Key messages
-
Large language models, combined with EULAR and ACR guidelines, may enhance rheumatology clinical decision support.
-
Retrieval augmented generation (RAG) responses showed significantly greater accuracy, safety and completeness than baseline LLMs.
-
RAG is a promising architecture for reducing hallucinations and providing grounded, reliable answers.