Atlas-Assisted Bone Age Estimation from Hand–Wrist Radiographs Using Multimodal Large Language Models: A Comparative Study

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background/Objectives: Bone age assessment is critical in pediatric endocrinology and forensic medicine. Although recently developed multimodal large language models (LLMs) show potential in medical imaging, their diagnostic performance in bone age determination has not been sufficiently evaluated. This study evaluates the performance of four multimodal LLMs (ChatGPT-5, Gemini 2.5 Pro, Grok-3, and Claude 4 Sonnet) in bone age determination using the Gilsanz-Ratib (GR) atlas. Methods: This retrospective study included 245 pediatric patients (109 male, 136 female) under age 18 who underwent left wrist radiography. Each model estimated bone age using the patient's radiograph and GR atlas as reference (atlas-assisted prompting). Bone age assessments made by an experienced radiologist using the GR atlas were evaluated as the reference standard. Performance was assessed using mean absolute error (MAE), intraclass correlation coefficient (ICC), and Bland-Altman analysis. Results: ChatGPT-5 demonstrated statistically superior performance with MAE of 1.46 years and ICC of 0.849, showing highest alignment with the reference standard. Gemini 2.5 Pro showed moderate performance with MAE of 2.24 years; Grok-3 (MAE: 3.14 years) and Claude 4 Sonnet (MAE: 4.29 years) had error rates too high for clinical use. Conclusions: Significant performance differences exist among multimodal LLMs despite atlas-supported prompting. Only ChatGPT-5 qualified as "clinically useful," demonstrating potential as an auxiliary tool or educational support under expert supervision. Other models' reliability remains insufficient.

Article activity feed