Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination

Yuichiro Hirano
Soichiro Miki
Yosuke Yamagishi
Shouhei Hanaoka
Takahiro Nakao
Tomohiro Kikuchi
Yuta Nakamura
Yukihiro Nomura
Takeharu Yoshikawa
Osamu Abe

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Purpose

To assess and compare the accuracy and legitimacy of multimodal large language models (LLMs) on the Japan Diagnostic Radiology Board Examination (JDRBE).

Materials and methods

The dataset comprised questions from JDRBE 2021, 2023, and 2024, with ground-truth answers established through consensus among multiple board-certified diagnostic radiologists. Questions without associated images and those lacking unanimous agreement on answers were excluded. Eight LLMs were evaluated: GPT-4 Turbo, GPT-4o, GPT-4.5, GPT-4.1, o3, o4-mini, Claude 3.7 Sonnet, and Gemini 2.5 Pro. Each model was evaluated under two conditions: with inputting images (vision) and without (text-only). Performance differences between the conditions were assessed using McNemar’s exact test. Two diagnostic radiologists (with 2 and 18 years of experience) independently rated the legitimacy of responses from four models (GPT-4 Turbo, Claude 3.7 Sonnet, o3, and Gemini 2.5 Pro) using a five-point Likert scale, blinded to model identity.

Legitimacy scores were analyzed using Friedman’s test, followed by pairwise Wilcoxon signed-rank tests with Holm correction.

Results

The dataset included 233 questions. Under the vision condition, o3 achieved the highest accuracy at 72%, followed by o4-mini (70%) and Gemini 2.5 Pro (70%). Under the text-only condition, o3 topped the list with an accuracy of 67%. Addition of image input significantly improved the accuracy of two models (Gemini 2.5 Pro and GPT-4.5), but not the others. Both o3 and Gemini 2.5 Pro received significantly higher legitimacy scores than GPT-4 Turbo and Claude 3.7 Sonnet from both raters.

Conclusion

Recent multimodal LLMs, particularly o3 and Gemini 2.5 Pro, have demonstrated remarkable progress on JDRBE questions, reflecting their rapid evolution in diagnostic radiology.

Secondary abstract

Eight multimodal large language models were evaluated on the Japan Diagnostic Radiology Board Examination. OpenAI’s o3 and Google DeepMind’s Gemini 2.5 Pro achieved high accuracy rates (72% and 70%) and received good legitimacy scores from human raters, demonstrating steady progress.

Version published to 10.1101/2025.06.23.25329534 on medRxiv
Jun 23, 2025

AI-literacy training enhances physician-LLM diagnostic collaboration in a resource-limited setting: a randomized controlled trial

This article has 6 authors:
1. Ihsan Ayyub Qazi
2. Ayesha Ali
3. Asad Ullah Khawaja
4. Muhammad Junaid Akhtar
5. Ali Zafar Sheikh
6. Muhammad Hamad Alizai
This article has no evaluationsLatest version Jun 6, 2025
Benchmarking Multimodal Large Language Models for Forensic Science and Medicine: A Comprehensive Dataset and Evaluation Framework

This article has 5 authors:
1. Ashmaan Sohail
2. Om M. Patel
3. Jihwan Choi
4. Jack C. S. Venditti
5. Addison J. Wu
This article has no evaluationsLatest version Jul 7, 2025
Evaluation of Closed and Open Large Language Models in Pediatric Cardiology Board Exam Performance

This article has 3 authors:
1. Nino Nikolovski
2. Conall T. Morgan
3. Michael N. Gritti
This article has no evaluationsLatest version Jun 30, 2025

Listed in

Abstract

Purpose

Materials and methods

Results

Conclusion

Article activity feed

Related articles

AI-literacy training enhances physician-LLM diagnostic collaboration in a resource-limited setting: a randomized controlled trial

Benchmarking Multimodal Large Language Models for Forensic Science and Medicine: A Comprehensive Dataset and Evaluation Framework

Evaluation of Closed and Open Large Language Models in Pediatric Cardiology Board Exam Performance