Augmenting Large Language Models and Retrieval-Augmented Generation with an Evidence-Based Medicine-Enabled Agent System

Yi Yu
Lingli Li
Yaqin Li

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Importanc

Large language models (LLMs) with retrieval-augmented generation (RAG) show promise for clinical decision support. However, their application is constrained by limited & outdated vector databases, suboptimal evidence retrieval and poor contextual continuity.

Objective

To develop and evaluate a novel LLM-based agent that integrates Evidence-Based Medicine (EBM) principles and contextual conversation capabilities in answering clinical questions.

Design, Setting and Participants

The agent for clinical decision making was developed and evaluated between July 1, 2024, and July 31, 2025. The system incorporated an EBM-enabled workflow, a memory module and Thought-Action-Observation (TAO) loops. Evaluation 1 assessed the system’s performance on 150 initial clinical questions across 15 cancer types. Evaluation 2 involved 45 multi-turn dialogue tasks (across 3 types). Baselines were state-of-the-art traditional RAG method and commercial LLMs with plugins. All generated responses across both evaluations were independently rated by 3 experts with over five years of clinical experience. The study was performed at West China Medical Center.

Main Outcomes and Measures

Each response in evaluation 1 was classified into one of three predefined categories—correct, inaccurate, or wrong. As for evaluation 2, tasks were deemed successful when previous conversation is remembered and answer is correct, otherwise the task was considered unsuccessful.

Results

In evaluation 1, EBMChat generated the highest proportion of accurate responses (89% vs 78% for the best baseline method). The superior performance of EBMChat was associated with its ability to retrieve optimal evidence, demonstrated by significantly higher evidence hierarchy (100% vs 17.5% RCT-level or above), stricter evidence timeliness (within 5 years vs from the 1980s onwards), and more comprehensive retrieval (median of 693 vs 267 items/question). Regarding evaluation 2, EBMChat successfully completed 93% of the tasks. In contrast, GPT-4.1 with plugins (Web Search) achieved a success rate of only 31%. This performance gap was attributed to EBM-enabled workflow, memory module and TAO loops, which ensure robust contextual conversation capabilities.

Conclusion and Relevanc

EBMChat identifies appropriate evidence by effectively balancing timeliness, hierarchy, and relevance. Meanwhile, its enhanced conversational capabilities facilitate the preservation of contextual data, enabling users to explore clinical problems more deeply or comprehensively in multi-turn dialogues. Our findings underscore that the effective promotion of clinical practice by AI requires deeper integration of core medical principles into the technology itself, rather than direct application of general-purpose AI tools.

Version published to 10.1101/2025.10.17.25338266 on medRxiv
Oct 20, 2025

Retrieval-Augmented Generation

This article has 1 author:
1. Charles Liu
This article has no evaluationsLatest version Nov 10, 2025
Evaluating Large Language Models for Colonoscopy Preparation Assistance: Correctness and Diversity in Synthetic Dialogues

This article has 7 authors:
1. Tomiris Kaumenova
2. Subhankar Chakraborty
3. Eric Fosler-Lussier
4. Kebire Gofar
5. Isaiah Metcalf
6. Andrew Perrault
7. Michael White
This article has no evaluationsLatest version Nov 20, 2025
Towards Multimodal Retrieval-Augmented Generation for Medical Visual Question Answering

This article has 6 authors:
1. Mai A. Shaaban
2. Mohammad Reza Zarei
3. Adnan Khan
4. Abbas Akkasi
5. Mohammad Yaqub
6. Majid Komeili
This article has no evaluationsLatest version Oct 28, 2025

Discuss this preprint

Listed in

Abstract

Importanc

Objective

Design, Setting and Participants

Main Outcomes and Measures

Results

Conclusion and Relevanc

Article activity feed

Related articles

Retrieval-Augmented Generation

Evaluating Large Language Models for Colonoscopy Preparation Assistance: Correctness and Diversity in Synthetic Dialogues

Towards Multimodal Retrieval-Augmented Generation for Medical Visual Question Answering