Benchmarking GPT-5 Performance and Repeatability on the Japanese National Examination for Radiological Technologists over the Past Decade (2016–2025)

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Purpose

To evaluate GPT-5 against GPT-4o on the Japanese national examination for radiological technologists (2016–2025), assessing accuracy, repeatability, and factors influencing performance differences.

Materials and methods

We analyzed 1,992 multiple-choice questions spanning medical and engineering domains, including text- and image-based questions. Both models answered all questions in Japanese under identical conditions across three independent runs. Majority-vote accuracy (correct if ≥ 2 of 3 runs were correct) and first-attempt accuracy were compared using McNemar’s test. Repeatability was quantified with Fleiss’ κ. Univariable and multivariable analyses were conducted to identify question-level factors associated with GPT-5 improvements.

Results

GPT-5 consistently outperformed GPT-4o across all 10 exam years, achieving a majority-vote accuracy of 92.8% (95% CI: 91.5–93.8) compared with 72.4% (95% CI: 70.4–74.4) (P < .001). Repeatability was higher for GPT-5 (κ = 0.925) than GPT-4o (κ = 0.904), with correct answers in all three runs for 88.2% vs. 68.9% of items. GPT-5 achieved marked gains on text-based questions (96.5% vs. 78.1%) and substantial improvements on image-based questions (72.6% vs. 41.9%). Within medical images, significant improvements were observed for MRI, CT, and radiography, whereas performance gains were smaller for ultrasound and nuclear medicine, highlighting persistent challenges in clinically oriented image interpretation. The greatest advantages overall were observed in calculation questions (97.3% vs. 39.3%) and engineering-related domains, consistent with external benchmarks highlighting GPT-5’s strengthened reasoning.

Conclusion

GPT-5 demonstrated significantly higher accuracy and repeatability than GPT-4o across a decade of examinations, with especially pronounced gains in quantitative reasoning, engineering content, and diagram interpretation. Although improvements extended to medical images, performance in clinical image interpretation remained limited.

Article activity feed