Agentic memory-augmented retrieval and evidence grounding for medical question-answering tasks

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objective: To evaluate if a tool-using agent-based system utilizing large language models (LLMs) for medical question-answering (QA) tasks outperforms standalone LLMs. Methods: We developed a unified, open-source LLM-based agentic system that integrates document retrieval, re-ranking, evidence grounding, and diagnosis generation to support dynamic, multi-step medical reasoning. Our system features a lightweight retrieval-augmented generation pipeline coupled with a cache-and-prune memory bank, enabling efficient long-context inference beyond standard LLM limits. The system autonomously invokes specialized tools, eliminating the need for manual prompt engineering or brittle multi-stage templates. We compared the agentic system against standalone LLMs on various medical QA benchmarks. Results: Evaluated on five well-known medical QA benchmarks, our system outperforms or closely matches state-of-the-art proprietary and open-source medical LLMs in multiple-choice and open-ended formats. Specifically, our system achieved accuracies of 82.98% on USMLE Step 1 and 86.24% on USMLE Step 2, surpassing GPT-4's 80.67% and 81.67%, respectively, while closely matching on USMLE Step 3 (88.52% vs. 89.78%). Conclusion: Our findings highlight the value of combining tool-augmented and evidence-grounded reasoning strategies to build reliable and scalable medical AI systems.

Article activity feed