AI vs Human Performance in Conversational Hospital-Based Neurological Diagnosis

Moran Sorka
Alon Gorenshtein
Hillel Abramovitch
Pannathat Soontrapa
Shahar Shelly
Dvir Aran

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Most evaluations of artificial intelligence (AI) in medicine rely on static, multiple-choice benchmarks that fail to capture the dynamic, sequential nature of clinical diagnosis. While conversational AI has shown promise in telemedicine, these systems rarely test the iterative decision-making process in which clinicians gather information, order tests, and refine diagnoses.

Methods

We developed DiagnosticXchange, a web-based platform simulating realistic clinical interactions between providers and specialist consultants. A ‘nurse’ agent responds to requests from human physicians or AI systems acting as diagnosticians. Sixteen neurological diagnostic challenges of varying complexity were drawn from diverse educational and peer-reviewed sources. We evaluated 14 neurologists at different training stages and multiple state-of-the-art large language models (LLMs) using efficiency metrics, including: diagnostic accuracy, procedural cost efficiency (based on CPT codes and hospital pricing), and time to diagnosis (using actual procedure durations). We also developed Gregory, a specialized multi-agent system that systematically generates differential diagnoses, challenges initial hypotheses, and strategically selects high-yield diagnostic tests.

Results

Human neurologists achieved 81% diagnostic accuracy (79% residents, 88% specialists) across 97 sessions; base LLMs ranged from 81-94%. Gregory achieved perfect diagnostic accuracy with markedly lower diagnostic costs (average $1,423; 95% CI: $450-$2,860) compared with human neurologists (average $3,041; 95% CI: $2,464-$3,677; p=0.008) and base LLMs (average $2,759; 95% CI: $2,137-$3,476; p=0.002). Time to diagnosis was also shorter with Gregory (23 days; 95% CI: 6-48) versus human neurologists (43 days; 95% CI: 31-58; p=0.002) and base models (41 days; 95% CI: 31-51; p=0.07). The platform revealed distinct diagnostic patterns: human users and some base LLMs frequently ordered broad and expensive testing, while Gregory employed targeted strategies that avoided unnecessary procedures without sacrificing thoroughness.

Conclusions

A well-designed multi-agent AI system outperformed both human physicians and base LLMs in diagnostic accuracy, while reducing costs and time. DiagnosticXchange enables systematic evaluation of diagnostic efficiency and reasoning in realistic, interactive scenarios, offering a clinically relevant alternative to static benchmarks and a pathway toward more effective AI-assisted diagnosis.

Version published to 10.1101/2025.08.13.25333529 on medRxiv
Aug 14, 2025

A Multi-AI Agent Framework for Interactive Neurosurgical Education and Evaluation: From Vignettes to Virtual Conversations

This article has 19 authors:
1. Karl L. Sangwon
2. Jeff Zhang
3. Robert Steele
4. Jaden Stryker
5. Daniel Alexander Alber
6. Aly Valliani
7. Nivedha Kannapadi
8. James Ryoo
9. Austin Feng
10. Hammad A. Khan
11. Sean Neifert
12. Cordelia Orillac
13. Hannah K. Weiss
14. Nora C. Kim
15. David Kurland
16. Howard A. Riina
17. Douglas Kondziolka
18. Michal Mankowski
19. Eric Karl Oermann
This article has no evaluationsLatest version Aug 24, 2025
Automation Bias in Large Language Model Assisted Diagnostic Reasoning Among AI-Trained Physicians

This article has 6 authors:
1. Ihsan Ayyub Qazi
2. Ayesha Ali
3. Asad Ullah Khawaja
4. Muhammad Junaid Akhtar
5. Ali Zafar Sheikh
6. Muhammad Hamad Alizai
This article has no evaluationsLatest version Aug 26, 2025
Evaluating Large Language Model Diagnostic Performance on JAMA Clinical Challenges via a Multi-Agent Conversational Framework

This article has 11 authors:
1. Karl L. Sangwon
2. Jeff Zhang
3. Robert Steele
4. Jaden Stryker
5. Jin Vivian Lee
6. Joanne Choi
7. Krithik Vishwanath
8. Daniel Alexander Alber
9. Douglas Kondziolka
10. Michal Mankowski
11. Eric Karl Oermann
This article has no evaluationsLatest version Aug 24, 2025

Listed in

Abstract

Background

Methods

Results

Conclusions

Article activity feed

Related articles

A Multi-AI Agent Framework for Interactive Neurosurgical Education and Evaluation: From Vignettes to Virtual Conversations

Automation Bias in Large Language Model Assisted Diagnostic Reasoning Among AI-Trained Physicians

Evaluating Large Language Model Diagnostic Performance on JAMA Clinical Challenges via a Multi-Agent Conversational Framework