A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric

Hao Guan
Peter C. Hou
Pengyu Hong
Liqin Wang
Wenyu Zhang
Xinsong Du
Zhengyang Zhou
Li Zhou

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Recent advances in vision-language models (VLMs) have enabled automatic radiology report generation, yet current evaluation methods remain limited to general-purpose NLP metrics or coarse classification-based clinical scores. In this study, we propose a clinically informed evaluation framework for VLM-generated radiology reports that goes beyond traditional performance measures. We define a taxonomy of 12 radiology-specific error types, each annotated with clinical risk levels (low, medium, high) in collaboration with physicians. Using this framework, we conduct a comprehensive error analysis of three representative VLMs, i.e., DeepSeek VL2, CXR-LLaVA, and CheXagent, on 685 gold-standard, expert-annotated MIMIC-CXR cases. We further introduce a risk-aware evaluation metric, the Clinical Risk-weighted Error Score for Text-generation (CREST), to quantify safety impact. Our findings reveal critical model vulnerabilities, common error patterns, and condition-specific risk profiles, offering actionable insights for model development and deployment. This work establishes a safety-centric foundation for evaluating and improving medical report generation models. The source code of our evaluation framework, including CREST computation and error taxonomy analysis, is available at https://github.com/guanharry/VLM-CREST .

Version published to 10.1101/2025.07.13.25331222 on medRxiv
Jul 14, 2025

Benchmark Evaluation of Multi-Modal Large Language Models for Ophthalmic Diagnosis

This article has 10 authors:
1. Weihua Yang
2. Shoujun Huang
3. Junhong Chen
4. Jiaoman Wang
5. Ping Zhang
6. Wending Du
7. Yuan Hong
8. Dexing Kong
9. Wei Lou
10. Wei Chi
This article has no evaluationsLatest version Jul 23, 2025
Enhancing Clinical Reasoning in Medical Vision-Language Model through Structured Prompts

This article has 3 authors:
1. Kavya Dasaramoole Prakash
2. Kiseong Kim
3. Youngmahn Han
This article has no evaluationsLatest version Aug 1, 2025
CLEVER: Clinical Large Language Model Evaluationby Expert Review

This article has 4 authors:
1. Veysel Kocaman
2. Mustafa Kaya
3. Andrei Ferrer
4. David Talby
This article has no evaluationsLatest version Jul 23, 2025

Listed in

Abstract

Article activity feed

Related articles

Benchmark Evaluation of Multi-Modal Large Language Models for Ophthalmic Diagnosis

Enhancing Clinical Reasoning in Medical Vision-Language Model through Structured Prompts

CLEVER: Clinical Large Language Model Evaluationby Expert Review