Limitations in Chest X-Ray Interpretation by Vision-Capable Large Language Models, Gemini 1.0, Gemini 1.5 Pro, GPT-4 Turbo, and GPT-4o

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background/Objectives: Interpretation of chest X-rays (CXRs) requires accurate identification of lesion presence, diagnosis, location, size, and number to be considered complete. However, the effectiveness of large language models with vision capabilities (LLMs) in performing these tasks remains uncertain. This study aimed to evaluate the image-only interpretation performance of LLMs in the absence of clinical information. Methods: A total of 247 CXRs covering 13 diagnostic categories, including pulmonary edema, cardiomegaly, lobar pneumonia, and other conditions, were evaluated using Gemini 1.0, Gemini 1.5 Pro, GPT-4 Turbo, and GPT-4o. The text outputs generated by the LLMs were evaluated at two levels: (1) primary diagnosis accuracy across the 13 predefined diagnostic categories, and (2) identification of key imaging features described in the generated text. Primary diagnosis accuracy was assessed based on whether the model correctly identified the target diagnostic category and was classified as fully correct, partially correct, or incorrect according to predefined clinical criteria. Non-diagnostic imaging features, such as posteroanterior and anteroposterior (PA/AP) views, side markers, foreign bodies, and devices, were recorded and analyzed separately rather than being incorporated into the primary diagnostic scoring. Results: When fully and partially correct responses were treated as successful detections, vLLMs showed higher sensitivity for large, bilateral, multiple lesions and prominent devices, including acute pulmonary edema, lobar pneumonia, multiple malignancies, massive pleural effusions, and pacemakers, all of which demonstrated statistically significant differences across categories in chi-square analyses. Feature descriptions varied among models, especially in PA/AP views and side markers, though central lines were partially recognized. Across the entire dataset, Gemini 1.5 Pro achieved the highest overall detection rate, followed by Gemini 1.0, GPT-4o, and GPT-4 Turbo. Conclusions: Although LLMs were able to identify certain diagnoses and key imaging features, their limitations in detecting small lesions, recognizing laterality, reasoning through differential diagnoses, and using domain-specific expressions indicate that CXR interpretation without textual cues still requires further improvement.

Article activity feed