A Multi-AI Agent Framework for Interactive Neurosurgical Education and Evaluation: From Vignettes to Virtual Conversations

Karl L. Sangwon
Jeff Zhang
Robert Steele
Jaden Stryker
Daniel Alexander Alber
Aly Valliani
Nivedha Kannapadi
James Ryoo
Austin Feng
Hammad A. Khan
Sean Neifert
Cordelia Orillac
Hannah K. Weiss
Nora C. Kim
David Kurland
Howard A. Riina
Douglas Kondziolka
Michal Mankowski
Eric Karl Oermann

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background and Objectives

Traditional medical board examinations present clinical information in static vignettes with multiple-choices, fundamentally different from how physicians gather and integrate data in practice. Recent advances in Large Language Models (LLMs) offer promising approaches to creating more realistic clinical interactive conversations. However, these approaches are limited in neurosurgery, where patient communication capacity varies significantly and diagnosis heavily relies on objective data like imaging and neurological examinations. We aimed to develop and evaluate a multi-AI agent conversation framework for neurosurgical case assessment that enables realistic clinical interactions through simulated patients and structured access to objective clinical data.

Methods

We developed a framework to convert 608 Self-Assessment in Neurological Surgery (SANS) first-order diagnosis questions into conversation sessions using three specialized AI agents: Patient AI for subjective information, System AI for objective data, and Clinical AI for diagnostic reasoning. We evaluated GPT-4o’s diagnostic accuracy across traditional vignettes, patient-only conversations, and patient+system AI interactions, with human benchmark testing from ten neurosurgery residents.

Results

GPT-4o showed significant performance drops from traditional vignettes to conversational formats in both multiple-choice (89.0% to 60.9%, p<0.0001) and free-response scenarios (78.4% to 30.3%, p<0.0001). Adding access to objective data through System AI improved performance (to 67.4%, p=0.0015 and 61.8%, p<0.0001, respectively). Questions requiring image interpretation showed similar patterns but lower accuracy. Residents outperformed GPT-4o in free-response conversations (70.0% vs 28.3%, p=0.0030) using fewer interactions and reported high educational value of the interactive format.

Conclusions

This multi-AI agent framework provides both a more challenging evaluation method for LLMs and an engaging educational tool for neurosurgical training. The significant performance drops in conversational formats suggest that traditional multiple-choice testing may overestimate LLMs’ clinical reasoning capabilities, while the framework’s interactive nature offers promising applications for enhancing medical education.

Version published to 10.1101/2025.08.20.25334084 on medRxiv
Aug 24, 2025

AI vs Human Performance in Conversational Hospital-Based Neurological Diagnosis

This article has 6 authors:
1. Moran Sorka
2. Alon Gorenshtein
3. Hillel Abramovitch
4. Pannathat Soontrapa
5. Shahar Shelly
6. Dvir Aran
This article has no evaluationsLatest version Aug 14, 2025
Evaluating Large Language Model Diagnostic Performance on JAMA Clinical Challenges via a Multi-Agent Conversational Framework

This article has 11 authors:
1. Karl L. Sangwon
2. Jeff Zhang
3. Robert Steele
4. Jaden Stryker
5. Jin Vivian Lee
6. Joanne Choi
7. Krithik Vishwanath
8. Daniel Alexander Alber
9. Douglas Kondziolka
10. Michal Mankowski
11. Eric Karl Oermann
This article has no evaluationsLatest version Aug 24, 2025
Large Language Model Agents for Biomedicine: A Comprehensive Review of Methods, Evaluations, Challenges, and Future Directions

This article has 2 authors:
1. xiaoran xu
2. Ravi Sankar
This article has no evaluationsLatest version Sep 15, 2025

Listed in

Abstract

Background and Objectives

Methods

Results

Conclusions

Article activity feed

Related articles

AI vs Human Performance in Conversational Hospital-Based Neurological Diagnosis

Evaluating Large Language Model Diagnostic Performance on JAMA Clinical Challenges via a Multi-Agent Conversational Framework

Large Language Model Agents for Biomedicine: A Comprehensive Review of Methods, Evaluations, Challenges, and Future Directions