Impact of Query Language on Large Language Model Performance in Dental Trauma Management: A Comparative Evaluation of ChatGPT, Gemini, and Claude

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Large language models (LLMs) are increasingly used as clinical decision support tools in healthcare, yet the impact of query language on their performance remains unclear, particularly in specialized domains like dental traumatology. This study evaluated whether LLM performance in dental trauma management differs based on the language of clinical scenarios (English vs. Turkish) and compared performance across three AI models. Methods Twenty-seven clinical scenarios covering 13 dental trauma categories were presented to ChatGPT 5.2, Gemini 3.0, and Claude 4.5 Sonnet in both English and Turkish, generating 162 responses. Two blinded endodontists independently evaluated responses using a standardized rubric assessing accuracy (40%), completeness (35%), and safety (25%) against IADT 2020 Guidelines. Inter-rater reliability was assessed using intraclass correlation coefficient (ICC). Language effects were analyzed using Wilcoxon signed-rank tests; model comparisons employed Kruskal-Wallis and Mann-Whitney U tests with Bonferroni correction. Results Inter-rater reliability was good across all dimensions (ICC: 0.738–0.836). ChatGPT showed the strongest language effect with 9.14% higher performance in English (p < 0.001, r = 0.874). Gemini showed moderate English advantage (5.69%, p = 0.003, r = 0.572). Claude exhibited language independence with virtually identical performance in both languages (-0.02%, p = 0.220). In English, significant model differences emerged (H = 22.31, p < 0.001); however, model performance converged in Turkish (H = 2.89, p = 0.236). Conclusions Language-dependent performance variations in LLMs are model-specific rather than universal. While ChatGPT achieved highest absolute scores, Claude’s language independence may offer more reliable performance in non-English clinical settings. These findings have implications for deployment of AI in multilingual healthcare environments.

Article activity feed