Systematic Comparison of Multimodal Large Language Models for Pediatric Profile Orthodontic Assessment and Early Intervention: ChatGPT, DeepSeek, and Gemini

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objective This study aimed to compare the performance of three multimodal large language models (ChatGPT, DeepSeek, and Gemini) in analyzing pediatric profile photographs and providing early orthodontic intervention recommendations, thereby assessing their clinical feasibility and reliability. Materials and Methods In this cross-sectional study, 100 children aged 5–12 years who attended Jiaxing Second Hospital between January and June 2025 were enrolled. Standardized profile photographs were obtained and processed uniformly before being analyzed by the three models using identical prompts. Model outputs were anonymized and independently evaluated under single-blind, randomized conditions by orthodontic experts and parents. An eight-dimension weighted scoring system was applied, encompassing professionalism, accuracy, completeness, individualization, safety, comprehensibility, empathy, and readability. Statistical analyses included the Friedman test, Wilcoxon signed-rank test, and Kendall’s W effect size. Results All three models achieved high overall scores, ranging from 3.9 to 4.2. ChatGPT consistently produced slightly higher mean scores (4.07–4.15), while DeepSeek and Gemini showed comparable performance (3.91–4.09). Inter-model differences were not statistically significant (all q > 0.05), and effect sizes were uniformly negligible (Kendall’s W = 0.003–0.029). Conclusions ChatGPT, DeepSeek, and Gemini demonstrated comparable and overall reliable performance in pediatric orthodontic screening based on profile photographs, with ChatGPT showing a slight but nonsignificant advantage. At present, LLMs may serve as supportive tools for early orthodontic assessment but cannot substitute for clinical expertise. Clinical Relevance: This study highlights the potential role of large language models in pediatric orthodontic screening, suggesting they may improve accessibility and efficiency. Future research should incorporate multimodal inputs such as lateral cephalograms, CBCT, and intraoral scans, and conduct multicenter, large-scale validation to enhance clinical translation.

Article activity feed