Comparison of Large Language Models’ Performance on Neurosurgical Board Examination Questions

Nicholas S. Andrade
Surya Donty

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Multiple-choice board examinations are a primary objective measure of competency in medicine. Large language models (LLMs) have demonstrated rapid improvements in performance on medical board examinations in the past two years. We evaluated five leading LLMs on neurosurgical board exam questions.

Methods

We evaluated five LLMs (OpenAI o1, OpenEvidence, Claude 3.5 Sonnet, Gemini 2.0, and xAI Grok2) on 500 multiple-choice questions from the Self-Assessment in Neurological Surgery (SANS) American Board of Neurological Surgery (ABNS) Primary Board Examination Review. Performance was analyzed across 12 subspecialty categories and compared to established passing thresholds.

Results

All models exceeded the threshold for passing, with OpenAI o1 achieving the highest accuracy (87.6%), followed by OpenEvidence (84.2%), Claude 3.5 Sonnet (83.2%), Gemini 2.0 (81.0%) and xAI Grok2 (79.0%). Performance was strongest in Other General (97.4%) and Peripheral Nerve (97.1%) categories, while Neuroradiology showed the lowest accuracy (57.4%) across all models.

Conclusions

State of the art LLMs continue to improve, and all models demonstrated strong performance on neurosurgical board examination questions. Medical image analysis continues to be a limitation of current LLMs. The current level of LLM performance challenges the relevance of written board examinations in trainee evaluation and suggests that LLMs are ready for implementation in clinical medicine and medical education.

Version published to 10.1101/2025.02.20.25322623v1 on medRxiv
Feb 24, 2025

Comparative Evaluation of Large Language Models for Medical Education: Performance Analysis in Urinary System Histology.

This article has 2 authors:
1. Anikó Szabó
2. Ghasem Dolatkhah Laein
This article has no evaluationsLatest version Mar 13, 2025
Evolution of AI in Anatomy Education: Comparing Current Large Language Models Against Historical ChatGPT Performance on USMLE-Style Questions

This article has 2 authors:
1. Olena Bolgova
2. Volodymyr Mavrych
This article has no evaluationsLatest version Mar 24, 2025
Do Language Models Think Like Doctors?

This article has 15 authors:
1. Liam G. McCoy
2. Rajiv Swamy
3. Nidhish Sagar
4. Minjia Wang
5. James Cao
6. Stephen Bacchi
7. Nigel Fong
8. Nigel CK Tan
9. Kevin Tan
10. Thomas A. Buckley
11. Peter Brodeur
12. Leo Anthony Celi
13. Arjun Manrai
14. Aloysius Humbert
15. Adam Rodman
This article has no evaluationsLatest version Feb 12, 2025

Listed in

Abstract

Background

Methods

Results

Conclusions

Article activity feed

Related articles

Comparative Evaluation of Large Language Models for Medical Education: Performance Analysis in Urinary System Histology.

Evolution of AI in Anatomy Education: Comparing Current Large Language Models Against Historical ChatGPT Performance on USMLE-Style Questions

Do Language Models Think Like Doctors?