More Harm than Help? Evaluating the Capabilities of Vision-Language Models in Neurological Image Analysis

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objectives: This study evaluates the performance of both open-source and commercial Vision Language Models (VLMs) in interpreting radiological images of neurological diseases, comparing their diagnostic accuracy to that of experienced neuroradiologists. Methods: A dataset comprising 100 cases of brain and spine pathologies with confirmed diagnoses was curated from the Radiopaedia database to reflect routine clinical neuroradiology practice. Five neuroradiologists reviewed the cases—including imaging and case presentations—to determine the most probable diagnosis. In parallel, five VLMs (Gemini 2.0, GPT-4o1-Preview, Llama 3.2 90b, Qwen 2.5, and Grok-2-vision) were provided with the same cases and tasked with generating three differential diagnoses along with their reasoning. Two neuroradiologists then evaluated the accuracy of both the single most probable diagnosis and the top three diagnoses produced by the VLMs, as well as the rationale provided, and assessed the potential for harmful outcomes based on the VLM outputs. Results: Neuroradiologists achieved a mean diagnostic accuracy of 86.2%, significantly outperforming all VLMs. Among the models, Gemini 2.0 achieved the highest accuracy at 35% with 28% of its diagnoses deemed potentially harmful, while Grok-2-vision had the lowest accuracy at 9% with 45% of its outputs categorized as harmful. All models demonstrated a trend toward slightly lower accuracy with an increasing number of images, however the strength of this relationship was modest. Evaluation of potential harm revealed that treatment delay was the most common risk for VLMs, ranging between 28% for Gemini 2.0 and 45% for Grok-2-vision. Error analysis indicated that the most frequent causes of misdiagnosis were incorrect anatomic classification—with error rates ranging from 26% for Gemini 2.0 to 53% for Grok-2-vision —and inaccurate description of imaging findings, which ranged from 35% for Gemini 2.0 to 72% for Grok-2-vision. Conclusion: While VLMs hold promise for enhancing radiological workflows, the current state-of-the-art of open-source and commercial models is far from being reliable for the interpretation of radiological images of neurological diseases.

Article activity feed