Benchmarking Multimodal Large Language Models for Binary Classification of Pediatric Chest X-Rays: A Comparative Evaluation Using a Public Pneumonia Dataset

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background/Objective: Pneumonia remains a leading cause of global mortality, with chest X-rays serving as a primary diagnostic tool. While large language models (LLMs) show promise in medical imaging, limited evidence exists regarding their performance in pediatric pneumonia classification. This study evaluates the binary classification performance of five prominent multimodal LLMs in distinguishing bacterial pneumonia from normal findings in pediatric chest X-rays. Methods: We evaluated GPT-4o, GPT-4.1, Gemini 2.5 Pro Preview, Claude 4 Sonnet, and Grok 2 Vision using 1,000 pediatric chest X-ray images (500 normal, 500 bacterial pneumonia) from the Guangzhou Women and Children's Medical Center dataset. Each model received identical binary classification prompts via their respective APIs. Performance was assessed using accuracy, sensitivity, specificity, F1-score, and confusion matrix analysis. Results: GPT-4.1 demonstrated the most balanced performance with 84% sensitivity and 76% specificity. GPT-4o achieved high sensitivity (99%) but poor specificity (18%), while Gemini 2.5 Pro showed similar patterns (97% sensitivity, 25% specificity). Claude 4 Sonnet classified all images as pneumonia (100% sensitivity, 0% specificity). Grok 2 Vision showed moderate performance with 82% sensitivity and 56% specificity. Conclusion: Substantial performance variability exists among LLMs for pediatric pneumonia detection. GPT-4.1 provided optimal clinical utility with balanced sensitivity and specificity, while other models showed concerning tendencies toward false positives. These findings underscore the necessity for rigorous benchmarking before clinical implementation of LLMs in pediatric radiology.

Article activity feed