More Harm than Help? Evaluating the Capabilities of Vision-Language Models in Neurological Image Analysis

Aymen Meddeb
Ida Rangus
Paolo Pagano
Insaf Dekhil
Soumaya Jelassi
Keno Bressem
Michael Scheel
Mike P. Wattjes
Sonia Nagi
Laurent Pierot
Sebastien Soize

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objectives: This study evaluates the performance of both open-source and commercial Vision Language Models (VLMs) in interpreting radiological images of neurological diseases, comparing their diagnostic accuracy to that of experienced neuroradiologists. Methods: A dataset comprising 100 cases of brain and spine pathologies with confirmed diagnoses was curated from the Radiopaedia database to reflect routine clinical neuroradiology practice. Five neuroradiologists reviewed the cases—including imaging and case presentations—to determine the most probable diagnosis. In parallel, five VLMs (Gemini 2.0, GPT-4o1-Preview, Llama 3.2 90b, Qwen 2.5, and Grok-2-vision) were provided with the same cases and tasked with generating three differential diagnoses along with their reasoning. Two neuroradiologists then evaluated the accuracy of both the single most probable diagnosis and the top three diagnoses produced by the VLMs, as well as the rationale provided, and assessed the potential for harmful outcomes based on the VLM outputs. Results: Neuroradiologists achieved a mean diagnostic accuracy of 86.2%, significantly outperforming all VLMs. Among the models, Gemini 2.0 achieved the highest accuracy at 35% with 28% of its diagnoses deemed potentially harmful, while Grok-2-vision had the lowest accuracy at 9% with 45% of its outputs categorized as harmful. All models demonstrated a trend toward slightly lower accuracy with an increasing number of images, however the strength of this relationship was modest. Evaluation of potential harm revealed that treatment delay was the most common risk for VLMs, ranging between 28% for Gemini 2.0 and 45% for Grok-2-vision. Error analysis indicated that the most frequent causes of misdiagnosis were incorrect anatomic classification—with error rates ranging from 26% for Gemini 2.0 to 53% for Grok-2-vision —and inaccurate description of imaging findings, which ranged from 35% for Gemini 2.0 to 72% for Grok-2-vision. Conclusion: While VLMs hold promise for enhancing radiological workflows, the current state-of-the-art of open-source and commercial models is far from being reliable for the interpretation of radiological images of neurological diseases.

Version published to 10.21203/rs.3.rs-6183659/v1 on Research Square
Jun 16, 2025

A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric

This article has 8 authors:
1. Hao Guan
2. Peter C. Hou
3. Pengyu Hong
4. Liqin Wang
5. Wenyu Zhang
6. Xinsong Du
7. Zhengyang Zhou
8. Li Zhou
This article has no evaluationsLatest version Jul 14, 2025
Benchmark Evaluation of Multi-Modal Large Language Models for Ophthalmic Diagnosis

This article has 10 authors:
1. Weihua Yang
2. Shoujun Huang
3. Junhong Chen
4. Jiaoman Wang
5. Ping Zhang
6. Wending Du
7. Yuan Hong
8. Dexing Kong
9. Wei Lou
10. Wei Chi
This article has no evaluationsLatest version Jul 23, 2025
An interdisciplinary, randomized, single-blind evaluation of state-of-the-art large language models for their implications and risks in medical diagnosis and management

This article has 43 authors:
1. Peikai Chen
2. Jifu Cai
3. Jiaying Zhou
4. Shaoxi Chen
5. Chenguang Xu
6. Lihua Yuan
7. Xiaoying Dai
8. Xiaowei Chen
9. Yanzhe Wei
10. Xia Li
11. Shaofeng Gong
12. Xiaolong Liang
13. Jiancheng Yang
14. Jun Jin
15. Kanglin Dai
16. Yuzhen Cui
17. Guan-Ming Kuang
18. Jianshen Xie
19. Libing Luo
20. Haibing Xiao
21. Shijie Yin
22. Jun Yang
23. Yulan Yan
24. Jianliang Chen
25. Yihua Chen
26. Qianshen Zhang
27. Qingshan Zhou
28. Lina Zhao
29. Min Wu
30. Xin Tang
31. Lei Rong
32. Zanxin Wang
33. Weifu Qiu
34. Yanli Wang
35. Liwen Cui
36. Xiangyang Li
37. Yong Hu
38. Huiren Tao
39. Nan Wu
40. Pearl Pai
41. Minxin Wei
42. Michael Kai-tsun To
43. Kenneth M.C. Cheung
This article has no evaluationsLatest version Jun 24, 2025

Listed in

Abstract

Article activity feed

Related articles

A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric

Benchmark Evaluation of Multi-Modal Large Language Models for Ophthalmic Diagnosis

An interdisciplinary, randomized, single-blind evaluation of state-of-the-art large language models for their implications and risks in medical diagnosis and management