Augmenting Large Language Models and Retrieval-Augmented Generation with an Evidence-Based Medicine-Enabled Agent System

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Importanc

Large language models (LLMs) with retrieval-augmented generation (RAG) show promise for clinical decision support. However, their application is constrained by limited & outdated vector databases, suboptimal evidence retrieval and poor contextual continuity.

Objective

To develop and evaluate a novel LLM-based agent that integrates Evidence-Based Medicine (EBM) principles and contextual conversation capabilities in answering clinical questions.

Design, Setting and Participants

The agent for clinical decision making was developed and evaluated between July 1, 2024, and July 31, 2025. The system incorporated an EBM-enabled workflow, a memory module and Thought-Action-Observation (TAO) loops. Evaluation 1 assessed the system’s performance on 150 initial clinical questions across 15 cancer types. Evaluation 2 involved 45 multi-turn dialogue tasks (across 3 types). Baselines were state-of-the-art traditional RAG method and commercial LLMs with plugins. All generated responses across both evaluations were independently rated by 3 experts with over five years of clinical experience. The study was performed at West China Medical Center.

Main Outcomes and Measures

Each response in evaluation 1 was classified into one of three predefined categories—correct, inaccurate, or wrong. As for evaluation 2, tasks were deemed successful when previous conversation is remembered and answer is correct, otherwise the task was considered unsuccessful.

Results

In evaluation 1, EBMChat generated the highest proportion of accurate responses (89% vs 78% for the best baseline method). The superior performance of EBMChat was associated with its ability to retrieve optimal evidence, demonstrated by significantly higher evidence hierarchy (100% vs 17.5% RCT-level or above), stricter evidence timeliness (within 5 years vs from the 1980s onwards), and more comprehensive retrieval (median of 693 vs 267 items/question). Regarding evaluation 2, EBMChat successfully completed 93% of the tasks. In contrast, GPT-4.1 with plugins (Web Search) achieved a success rate of only 31%. This performance gap was attributed to EBM-enabled workflow, memory module and TAO loops, which ensure robust contextual conversation capabilities.

Conclusion and Relevanc

EBMChat identifies appropriate evidence by effectively balancing timeliness, hierarchy, and relevance. Meanwhile, its enhanced conversational capabilities facilitate the preservation of contextual data, enabling users to explore clinical problems more deeply or comprehensively in multi-turn dialogues. Our findings underscore that the effective promotion of clinical practice by AI requires deeper integration of core medical principles into the technology itself, rather than direct application of general-purpose AI tools.

Article activity feed