Evaluating Large Language Model Diagnostic Performance on JAMA Clinical Challenges via a Multi-Agent Conversational Framework

Karl L. Sangwon
Jeff Zhang
Robert Steele
Jaden Stryker
Jin Vivian Lee
Joanne Choi
Krithik Vishwanath
Daniel Alexander Alber
Douglas Kondziolka
Michal Mankowski
Eric Karl Oermann

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background & Objective

Standard clinical LLM benchmarks use multiple-choice vignettes that present all information up front, unlike real encounters where clinicians iteratively elicit histories and objective data. We hypothesized that such formats inflate LLM performance and mask weaknesses in diagnostic reasoning. We developed and evaluated a multi-AI agent conversational framework that converts JAMA Clinical Challenge cases into multi-turn dialogues, and assessed its impact on diagnostic accuracy across frontier LLMs.

Methods

We adapted 815 diagnostic cases from 1,519 JAMA Clinical Challenges into two formats: (1) original vignette and (2) multi-agent conversation with a Patient AI (subjective history) and a System AI (objective data: exam, labs, imaging). A Clinical LLM queried these agents and produced a final diagnosis. Models tested were O1 (OpenAI), GPT-4o (OpenAI), LLaMA-3-70B (Meta), and Deepseek-R1-distill-LLaMA3-70B (Deepseek), each in multiple-choice and free-response modes. Free-response grading used a separate GPT-4o judge for diagnostic equivalence. Accuracy (Wilson 95% CIs) and conversation lengths were compared using two-tailed tests.

Results

Accuracy decreased for all models when moving from vignettes to conversations and from multiple-choice to free-response (p<0.0001 for all pairwise comparisons). In vignette multiple-choice, accuracy was O1 79.8% (95% CI, 76.9%–82.4%), GPT-4o 74.5% (71.4%–77.4%), LLaMA-3 70.9% (69.5%–72.2%), Deepseek-R1 69.0% (67.5%–70.4%). In conversation multiple-choice: O1 69.1% (65.8%–72.2%), GPT-4o 51.3% (49.8%–52.8%), LLaMA-3 49.7% (48.2%–51.3%), Deepseek-R1 34.0% (32.6%–35.5%). In conversation free-response: O1 31.7% (28.6%–34.9%), GPT-4o 20.7% (19.5%–22.0%), LLaMA-3 22.9% (21.6%–24.2%), Deepseek-R1 9.3% (8.4%–10.2%). O1 generally required fewer conversational turns than GPT-4o, suggesting more efficient multi-turn reasoning.

Conclusions

Converting vignettes into multi-agent, multi-turn dialogues reveals substantial performance drops across leading LLMs, indicating that static multiple-choice benchmarks overestimate clinical reasoning competence. Our open-source framework offers a more rigorous and discriminative evaluation and a realistic substrate for educational use, enabling assessment of iterative information-gathering and synthesis that better reflects clinical practice.

Version published to 10.1101/2025.08.20.25334087 on medRxiv
Aug 24, 2025

AI vs Human Performance in Conversational Hospital-Based Neurological Diagnosis

This article has 6 authors:
1. Moran Sorka
2. Alon Gorenshtein
3. Hillel Abramovitch
4. Pannathat Soontrapa
5. Shahar Shelly
6. Dvir Aran
This article has no evaluationsLatest version Aug 14, 2025
A Multi-AI Agent Framework for Interactive Neurosurgical Education and Evaluation: From Vignettes to Virtual Conversations

This article has 19 authors:
1. Karl L. Sangwon
2. Jeff Zhang
3. Robert Steele
4. Jaden Stryker
5. Daniel Alexander Alber
6. Aly Valliani
7. Nivedha Kannapadi
8. James Ryoo
9. Austin Feng
10. Hammad A. Khan
11. Sean Neifert
12. Cordelia Orillac
13. Hannah K. Weiss
14. Nora C. Kim
15. David Kurland
16. Howard A. Riina
17. Douglas Kondziolka
18. Michal Mankowski
19. Eric Karl Oermann
This article has no evaluationsLatest version Aug 24, 2025
Context Matching is not Reasoning: Assessing Generalized Evaluation of Generative Language Models in Clinical Settings

This article has 15 authors:
1. Andrew Wen
2. Qiuhao Lu
3. Yu-Neng Chuang
4. Guanchu Wang
5. Jiayi Yuan
6. Jiamu Zhang
7. Liwei Wang
8. Sunyang Fu
9. Kurt D. Miller
10. Heling Jia
11. Steven D. Bedrick
12. William R Hersh
13. Kirk E. Roberts
14. Xia Hu
15. Hongfang Liu
This article has no evaluationsLatest version Aug 29, 2025

Listed in

Abstract

Background & Objective

Methods

Results

Conclusions

Article activity feed

Related articles

AI vs Human Performance in Conversational Hospital-Based Neurological Diagnosis

A Multi-AI Agent Framework for Interactive Neurosurgical Education and Evaluation: From Vignettes to Virtual Conversations

Context Matching is not Reasoning: Assessing Generalized Evaluation of Generative Language Models in Clinical Settings