Real-world Validation of MedSearch: a conversational agent for real-time, evidence-based medical question-answering

Natalia Castano-Villegas
Isabella Llano
Maria Camila Villa
Jose Zea

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Introduction

Application of Large Language Models (LLMs) powered Conversation Agents (CAs) in healthcare has been evaluated using medical question-answering (QA) datasets, with excellent performance in international medical licensing exams [1, 2, 3]. However, multiple-choice questions fall short when the intention is to assess more complex language interactions and open-ended responses.

Objective

to evaluate the time invested and the validity of health care personnel’s (HCP) responses to clinical questions using Med-Search compared to traditional search methods without AI.

Methods

This was a randomized, double-blind trial with 100 physicians assigned to two groups. Each group answered four clinical cases with four questions, one group using MedSearch and the other using traditional research methods such as Google and PubMed, except AI. Field specialists evaluated responses in six aspects established to define the validity of answers. Time to respond was also recorded, described, and compared between the two groups.

Results

More than 70% of the sample were medical students. Differences in results between groups were statistically significant in all evaluated aspects (p <0.01): the intervention (MedSearch) group arrived at a final answer in half the time (three minutes faster) of the control group (traditional research methods), with approximately 66% fewer searches per case. The model’s answers were valid (accurate, current, aligned with consensus, and safe) with an average score of 2.8 on a scale from 1 to 3. Most MedSearch users found it useful for daily practice and would recommend it to colleagues.

Conclusion

the present results suggest a positive impact of LLM-supported methods for a more effective clinical search, without sacrificing, and even augmenting the quality of answers. More clinical validations are needed to understand further the effect of LLMs use in education and clinical practice, using broader sample sizes and across professionals from different fields.

Version published to 10.1101/2025.05.02.25326659 on medRxiv
May 6, 2025

Introducing Answered with Evidence - a framework for evaluating whether LLM responses to biomedical questions are founded in evidence

This article has 5 authors:
1. Julian D Baldwin
2. Christina Dinh
3. Arjun Mukerji
4. Neil Sanghavi
5. Saurabh Gombar
This article has no evaluationsLatest version Jul 2, 2025
CLEVER: Clinical Large Language Model Evaluationby Expert Review

This article has 4 authors:
1. Veysel Kocaman
2. Mustafa Kaya
3. Andrei Ferrer
4. David Talby
This article has no evaluationsLatest version Jul 23, 2025
Asking the Right Questions: Evaluating Diagnostic Dialogue with Q4Dx

This article has 4 authors:
1. Mai Werthaim
2. Maya Kimhi
3. Alexander Apartsin
4. Yehudit Aperstein
This article has no evaluationsLatest version Jul 25, 2025

Listed in

Abstract

Introduction

Objective

Methods

Results

Conclusion

Article activity feed

Related articles

Introducing Answered with Evidence - a framework for evaluating whether LLM responses to biomedical questions are founded in evidence

CLEVER: Clinical Large Language Model Evaluationby Expert Review

Asking the Right Questions: Evaluating Diagnostic Dialogue with Q4Dx