Performance of Vision-Language Models for Zero-Shot Lung Nodule Detection on Chest Radiographs

Mizuho Nishio
Hidetoshi Matsuo
Takaaki Matsunaga
Koji Fujimoto
Nicolas Deperrois
Farhad Nooralahzadeh
Thomas Frauenfelder
Michael Krauthammer
Takamichi Murakami

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background and Objectives

The ability of vision-language models (VLMs) to detect lung nodules on chest radiographs remains uncertain. This retrospective study aimed to compare the zero-shot performances of six VLMs for lung nodule detection using data from the Japanese Society of Radiological Technology (JSRT) chest radiograph database.

Methods

A total of 247 chest radiographs from the JSRT database (154 with nodules and 93 without) were preprocessed and evaluated using six VLMs: RadVLM, gpt-4o-mini, Qwen3-VL-8B-Instruct, MedGemma-4b-it, LLaVA-Rad, and CheXpert Plus Model. Each model was tested using a zero-shot setting. The text outputs were binarized into nodule-present or nodule-absent labels by consensus between the two radiologists. Sensitivity, specificity, accuracy, precision, and F1 scores were calculated. Pairwise differences in sensitivity, specificity, and accuracy were assessed using McNemar’s test with Holm correction.

Results

The overall performance was limited across all models. RadVLM achieved the highest accuracy (44.5%, 110/247) with perfect specificity (100.0%, 93/93) and precision (100.0%); however, its sensitivity was low (11.0%, 17/154). LLaVA-Rad showed the highest sensitivity (27.3%, 42/154) and F1 score (37.7%), but lower specificity (71.0%, 66/93). MedGemma-4b-it achieved 100.0% specificity, with a sensitivity of only 5.2% (8/154). Grade-specific analysis showed that detection rates were highest for obvious nodules and remained limited for subtle nodules. Pairwise analyses revealed significant differences in sensitivity and specificity for the selected model pairs, particularly between RadVLM and LLaVA-Rad.

Conclusion

Current VLMs show limited zero-shot generalizability for lung nodule detection in the JSRT database, with marked trade-offs between sensitivity and specificity. Their near-term value may lie more in radiologist-assisted workflows than in stand-alone detection.

Clinical Impact

Current VLMs should not be used as stand-alone tools for lung nodule detection on chest radiographs because of their limited sensitivity and substantial model-dependent trade-offs. However, their high-specificity outputs in some models and higher-sensitivity behavior in others suggest potential roles in radiologist-assisted workflows, such as report drafting and second-reader support.

Version published to 10.64898/2026.05.31.26354565 on medRxiv
Jun 3, 2026

Board-Level Performance of Leading Open-Weight Vision-Language Models on the Japanese Diagnostic Radiology Board Examination: Reasoning, Image-Input, and Language Effects

This article has 14 authors:
1. Yuki Sonoda
2. Yosuke Yamagishi
3. Yuichiro Hirano
4. Soichiro Miki
5. Takahiro Nakao
6. Shouhei Hanaoka
7. Yukihiro Nomura
8. Akiyoshi Hamada
9. Noriko Kanemaru
10. Rintaro Miyo
11. Masumi Mizuki Takahashi
12. Reina Hosoi
13. Takeharu Yoshikawa
14. Osamu Abe
This article has no evaluationsLatest version Jul 13, 2026
Detection without calibration: benchmarking domestic and international large language models for quality control of Mandarin ¹⁸ F-FDG PET/CT reports

This article has 5 authors:
1. Jingbo Wang
2. Weiqing Tang
3. Xingdi Ma
4. Huimin Yan
5. Ying Yuan
This article has no evaluationsLatest version Jun 26, 2026
Diagnostic accuracy of a DenseNet-121 deep learning algorithm for chest radiograph triage in health assessment applicants: a prospective shadow-mode validation study in Nepal

This article has 3 authors:
1. Lochan Shrestha
2. Dinesh Maharjan
3. Uttam Bista
This article has no evaluationsLatest version Jul 1, 2026

Discuss this preprint

Listed in

Abstract

Background and Objectives

Methods

Results

Conclusion

Clinical Impact

Article activity feed

Related articles

Board-Level Performance of Leading Open-Weight Vision-Language Models on the Japanese Diagnostic Radiology Board Examination: Reasoning, Image-Input, and Language Effects

Detection without calibration: benchmarking domestic and international large language models for quality control of Mandarin 18 F-FDG PET/CT reports

Diagnostic accuracy of a DenseNet-121 deep learning algorithm for chest radiograph triage in health assessment applicants: a prospective shadow-mode validation study in Nepal

Detection without calibration: benchmarking domestic and international large language models for quality control of Mandarin ¹⁸ F-FDG PET/CT reports