Augmenting Large Language Models and Retrieval-Augmented Generation with an Evidence-Based Medicine-Enabled Agent System
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Importanc
Large language models (LLMs) with retrieval-augmented generation (RAG) show promise for clinical decision support. However, their application is constrained by limited & outdated vector databases, suboptimal evidence retrieval and poor contextual continuity.
Objective
To develop and evaluate a novel LLM-based agent that integrates Evidence-Based Medicine (EBM) principles and contextual conversation capabilities in answering clinical questions.
Design, Setting and Participants
The agent for clinical decision making was developed and evaluated between July 1, 2024, and July 31, 2025. The system incorporated an EBM-enabled workflow, a memory module and Thought-Action-Observation (TAO) loops. Evaluation 1 assessed the system’s performance on 150 initial clinical questions across 15 cancer types. Evaluation 2 involved 45 multi-turn dialogue tasks (across 3 types). Baselines were state-of-the-art traditional RAG method and commercial LLMs with plugins. All generated responses across both evaluations were independently rated by 3 experts with over five years of clinical experience. The study was performed at West China Medical Center.
Main Outcomes and Measures
Each response in evaluation 1 was classified into one of three predefined categories—correct, inaccurate, or wrong. As for evaluation 2, tasks were deemed successful when previous conversation is remembered and answer is correct, otherwise the task was considered unsuccessful.
Results
In evaluation 1, EBMChat generated the highest proportion of accurate responses (89% vs 78% for the best baseline method). The superior performance of EBMChat was associated with its ability to retrieve optimal evidence, demonstrated by significantly higher evidence hierarchy (100% vs 17.5% RCT-level or above), stricter evidence timeliness (within 5 years vs from the 1980s onwards), and more comprehensive retrieval (median of 693 vs 267 items/question). Regarding evaluation 2, EBMChat successfully completed 93% of the tasks. In contrast, GPT-4.1 with plugins (Web Search) achieved a success rate of only 31%. This performance gap was attributed to EBM-enabled workflow, memory module and TAO loops, which ensure robust contextual conversation capabilities.
Conclusion and Relevanc
EBMChat identifies appropriate evidence by effectively balancing timeliness, hierarchy, and relevance. Meanwhile, its enhanced conversational capabilities facilitate the preservation of contextual data, enabling users to explore clinical problems more deeply or comprehensively in multi-turn dialogues. Our findings underscore that the effective promotion of clinical practice by AI requires deeper integration of core medical principles into the technology itself, rather than direct application of general-purpose AI tools.