Performance of Artificial Intelligence and Large Language Models (LLMs) on Neurosurgical Board Examinations Across Text and Visual Modalities

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Large language models (LLMs) are increasingly applied in clinical use and medical education, yet their reliability across both textual and visual modalities in highly specialized domains such as neurosurgery remains unclear. Prior evaluations have been limited to text-only questions or narrow subsets of models, leaving the role of multimodal reasoning poorly defined. Methods: We benchmarked the latest LLMs, including 12 text-only and 10 multimodal systems, on 476 board-style neurosurgical questions spanning 13 subspecialties. Models were queried synchronously, temperature=0 under standardized prompting with no reasoning allowed mirroring examination conditions. Accuracy was compared across text and visual modalities, stratified by subspecialty and imaging type. Robustness was assessed using latency, parsing failures, and ablation experiments withholding clinical vignette components. Results: Text only and multimodal models achieved nearly identical mean accuracies of 67.9% and 68.5%, indicating that visual inputs did not provide a consistent overall benefit. Performance differed markedly across individual models. Gemini 2.5 Pro, Grok, and GPT 5 exceeded 80% accuracy, approaching resident performance. GPT 4.0 and GPT 4.5 followed in the high 70s. Claude Sonnet 3.7 and Claude Opus 4.1 performed in the mid 70s, while MedGemma and Llama 4 clustered in the low 70s. DeepSeek R1V3 performed close to chance. On image-based questions Gemini 2.5 Pro again led, while Grok, GPT 4.0, GPT 5, and Claude Opus clustered near 70% and Llama 4 models dropped to approximately 50%. Subspecialty analysis showed that visual input improved performance in neuroradiology, tumor, pediatrics, and spine. Trauma, vascular, and pain questions became less accurate with images, producing a bimodal pattern of benefit. Ablation experiments showed that removal of history produced the largest decline in accuracy (19.3% reduction), while withholding physical exam or lab data produced smaller effects (6.0% and 5.9%). A set of questions that no model could answer correctly accounted for 4% of the dataset. These questions were clustered in neuroradiology, vascular anatomy, and rare pediatric condition. Operational findings highlighted practical issues. The most accurate models were often slower to respond. Latency ranged from 0.22 seconds to more than 27 seconds. Parsing failures were uncommon in GPT 5, GPT 4.5, and Llama 4 but exceeded 13% in Claude Opus. Conclusions: Current LLMs can approach resident-level performance in structured neurosurgical domains and demonstrate selective benefits from visual input, but remain unreliable in anatomy-heavy, high-stakes contexts such as vascular and trauma. Their dependence on clinical history and susceptibility to systematic visual errors highlight the need for improved vision–language alignment before unsupervised clinical use. Until then, their role is best suited to supervised educational support with explicit safeguards.

Article activity feed