Evaluating the diagnostic accuracy of vision language models for neuroradiological image interpretation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study evaluates the diagnostic performance of commercial and open-source Vision-Language Models (VLMs) in neuroradiological image interpretation, using a dataset of 100 brain and spine cases from Radiopaedia. Five VLMs (Gemini 2.0, OpenAI o1, Llama 3.2 90b, Qwen 2.5, Grok-2-vision) were compared to expert neuroradiologists in generating differential diagnoses based on brief clinical presentations and imaging. Neuroradiologists achieved a mean accuracy of 86.2%, whereas the best-performing VLM (Gemini 2.0) reached 35%. Evaluation of the top three differentials improved VLM accuracy marginally, but remained inferior to human experts. Clinical harm analysis revealed frequent diagnostic risks, primarily treatment delays, with harmful outputs in up to 45% of cases. Error analysis showed consistent failure modes including incorrect anatomical localization, inaccurate imaging descriptions, and hallucinated findings. These results highlight the current limitations of VLMs and underscore the importance of expert oversight in neuroradiological diagnosis.