Optimizing the Clinical Application of Rheumatology Guidelines Using Large Language Models: A Retrieval-Augmented Generation Framework Integrating ACR and EULAR Recommendations

Alfredo Madrid-García
Diego Benavent
Beatriz Merino-Barbancho
Dalifer Freites-Núnez

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objectives

To develop and evaluate a Retrieval-Augmented Generation (RAG) system integrating European Alliance of Associations for Rheumatology (EULAR) and American College of Rheumatology (ACR) guidelines to provide rheumatologists with timely, evidence-based recommendations at the point of care.

Methods

EULAR and ACR and management guidelines were selected by rheumatologists according to relevance to clinical decision making, processed, and chunked. A RAG system using LangChain framework, voyage-3 embedding model, and a Qdrant vector database was implemented. Answers to 740 guideline-specific questions were generated by ChatGPT-o3-mini with context retrieval (RAG) and without (baseline). Performance was evaluated using an LLM-as-a-judge (Gemini 2.0 Flash) assessing factual accuracy, safety, completeness, faithfulness, and preference, with Wilcoxon signed-rank and Binomial tests for statistical significance.

Results

After agreement, 74 guidelines were included. The RAG-based system received consistently higher or comparable medians than the baseline across all criteria, relevance, factual accuracy, safety, completeness and conciseness (p<0.001). Moreover, the RAG-based system was significantly preferred by the LLM-judge in 92.8% of comparisons (p<0.001).

Conclusion

This study demonstrates the successful development and validation of a RAG system integrating extensive ACR/EULAR guidelines. The system significantly improves answer quality compared to a baseline LLM, providing a promising foundation for reliable, AI-driven clinical decision support tools in rheumatology to enhance guideline adherence.

Key messages

Large language models, combined with EULAR and ACR guidelines, may enhance rheumatology clinical decision support.
Retrieval augmented generation (RAG) responses showed significantly greater accuracy, safety and completeness than baseline LLMs.
RAG is a promising architecture for reducing hallucinations and providing grounded, reliable answers.

Version published to 10.1101/2025.04.10.25325588v1 on medRxiv
Apr 11, 2025

Retrieval-Augmented-Generation large language models outperform junior clinicians in guideline-concordant PSA testing

This article has 14 authors:
1. Joshua Yi Min Tung
2. Quan Le
3. Jinxuan Yao
4. Yifei Huang
5. Daniel Yan Zheng Lim
6. Gerald Gui Ren Sng
7. Rachel Shu En Lau
8. Yu Guang Tan
9. Kenneth Chen
10. Kae Jack Tay
11. Jen Hong Tan
12. John Shyi-Peng Yuen
13. Christopher Wai Sam Cheng
14. Henry Sun Sien Ho
This article has no evaluationsLatest version Mar 31, 2025
Bridging AI and Healthcare: A Scoping Review of Retrieval-Augmented Generation—Ethics, Bias, Transparency, Improvements, and Applications

This article has 5 authors:
1. David J. Bunnell
2. Mary J. Bondy
3. Lucy M. Fromtling
4. Emilie Ludeman
5. Krishnaj Gourab
This article has no evaluationsLatest version Apr 1, 2025
Reproducible Generative AI Evaluation for Healthcare: A Clinician-in-the-Loop Approach

This article has 7 authors:
1. Leah Livingston
2. Amber Featherstone-Uwague
3. Amanda Barry
4. Kenneth Barretto
5. Tara Morey
6. Drahomira Herrmannova
7. Venkatesh Avula
This article has no evaluationsLatest version Mar 7, 2025

Listed in

Abstract

Objectives

Methods

Results

Conclusion

Key messages

Article activity feed

Related articles

Retrieval-Augmented-Generation large language models outperform junior clinicians in guideline-concordant PSA testing

Bridging AI and Healthcare: A Scoping Review of Retrieval-Augmented Generation—Ethics, Bias, Transparency, Improvements, and Applications

Reproducible Generative AI Evaluation for Healthcare: A Clinician-in-the-Loop Approach