Agentic memory-augmented retrieval and evidence grounding for medical question-answering tasks

Shuyue Jia
Subhrangshu Bit
Varuna H. Jasodanand
Yi Liu
Vijaya B Kolachalama

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective: To evaluate if a tool-using agent-based system utilizing large language models (LLMs) for medical question-answering (QA) tasks outperforms standalone LLMs. Methods: We developed a unified, open-source LLM-based agentic system that integrates document retrieval, re-ranking, evidence grounding, and diagnosis generation to support dynamic, multi-step medical reasoning. Our system features a lightweight retrieval-augmented generation pipeline coupled with a cache-and-prune memory bank, enabling efficient long-context inference beyond standard LLM limits. The system autonomously invokes specialized tools, eliminating the need for manual prompt engineering or brittle multi-stage templates. We compared the agentic system against standalone LLMs on various medical QA benchmarks. Results: Evaluated on five well-known medical QA benchmarks, our system outperforms or closely matches state-of-the-art proprietary and open-source medical LLMs in multiple-choice and open-ended formats. Specifically, our system achieved accuracies of 82.98% on USMLE Step 1 and 86.24% on USMLE Step 2, surpassing GPT-4's 80.67% and 81.67%, respectively, while closely matching on USMLE Step 3 (88.52% vs. 89.78%). Conclusion: Our findings highlight the value of combining tool-augmented and evidence-grounded reasoning strategies to build reliable and scalable medical AI systems.

Version published to 10.1101/2025.08.06.25333160 on medRxiv
Aug 8, 2025

Open-source DeepSeek-R1 Outperforms Proprietary Non-Reasoning Large Language Models With and Without Retrieval-Augmented Generation

This article has 4 authors:
1. Scott Song
2. Kenneth C. Peng
3. Elizabeth T. Wang
4. T.Y. Alvin Liu
This article has no evaluationsLatest version Sep 14, 2025
Hybrid Memory-Retrieval Model: Enhancing Trust in Medical Chatbots

This article has 1 author:
1. Sagarika Singh
This article has no evaluationsLatest version Sep 8, 2025
Learning to Retrieve, Generate, and Compress: A Unified View of Efficient RAG

This article has 4 authors:
1. Faruq Brontes
2. Jeanie Genesis
3. Zachariah Noa
4. Sigiwardaz Nymphodoros
This article has no evaluationsLatest version Aug 18, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Open-source DeepSeek-R1 Outperforms Proprietary Non-Reasoning Large Language Models With and Without Retrieval-Augmented Generation

Hybrid Memory-Retrieval Model: Enhancing Trust in Medical Chatbots

Learning to Retrieve, Generate, and Compress: A Unified View of Efficient RAG