Limitations in Chest X‐Ray Interpretation by Vision‐Capable Large Language Models, Gemini 1.0, Gemini 1.5 Pro, GPT‐4 Turbo, and GPT‐4o
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background/Objectives: Interpretation of chest X-rays (CXRs) requires accurate identification of lesion presence, diagnosis, location, size, and number to be considered complete. However, the effectiveness of large language models with vision capabilities (vLLMs) in performing these tasks remains uncertain. This study aimed to evaluate the image interpretation performance of vLLMs in the absence of clinical information. Methods: A total of 247 CXRs covering 13 diagnoses, such as pulmonary edema, cardiomegaly, lobar pneumonia, and other medical conditions, were evaluated using Gemini 1.0, Gemini 1.5 Pro, GPT-4 Turbo, and GPT-4o. The text outputs generated by the vLLMs were assessed for diagnostic accuracy and identification of key imaging features. Each interpretation was classified as fully correct, partially correct, or incorrect according to the criteria for complete interpretation. Results: When both fully and partially correct responses were considered as successful detections, vLLMs effectively identified large, bilateral, multiple lesions and big devices, such as acute pulmonary edema (53.8%), lobar pneumonia (55%), multiple malignancies (55%), massive pleural effusions (47.5%) and pacemakers (98.3%), showing significant differences in the chi-square test. Feature descriptions varied among models, especially in posteroanterior and anteroposterior views and side markers, though central lines were partially recognized. Gemini 1.5 Pro (49.0%) performed best, followed by Gemini 1.0 (43.8%), GPT-4o (32.0%), and GPT-4Turbo (20.0%). Conclusions: Although vLLMs were able to identify certain diagnoses and key imaging features, their limitations in detecting small lesions, recognizing laterality, reasoning through differential diagnoses, and using domain-specific expressions indicate that CXR interpretation without textual cues still requires further improvement.