Impact of Query Language on Large Language Model Performance in Dental Trauma Management: A Comparative Evaluation of ChatGPT, Gemini, and Claude

Hasan Öz
Mehmet Dundar

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Large language models (LLMs) are increasingly used as clinical decision support tools in healthcare, yet the impact of query language on their performance remains unclear, particularly in specialized domains like dental traumatology. This study evaluated whether LLM performance in dental trauma management differs based on the language of clinical scenarios (English vs. Turkish) and compared performance across three AI models. Methods Twenty-seven clinical scenarios covering 13 dental trauma categories were presented to ChatGPT 5.2, Gemini 3.0, and Claude 4.5 Sonnet in both English and Turkish, generating 162 responses. Two blinded endodontists independently evaluated responses using a standardized rubric assessing accuracy (40%), completeness (35%), and safety (25%) against IADT 2020 Guidelines. Inter-rater reliability was assessed using intraclass correlation coefficient (ICC). Language effects were analyzed using Wilcoxon signed-rank tests; model comparisons employed Kruskal-Wallis and Mann-Whitney U tests with Bonferroni correction. Results Inter-rater reliability was good across all dimensions (ICC: 0.738–0.836). ChatGPT showed the strongest language effect with 9.14% higher performance in English (p < 0.001, r = 0.874). Gemini showed moderate English advantage (5.69%, p = 0.003, r = 0.572). Claude exhibited language independence with virtually identical performance in both languages (-0.02%, p = 0.220). In English, significant model differences emerged (H = 22.31, p < 0.001); however, model performance converged in Turkish (H = 2.89, p = 0.236). Conclusions Language-dependent performance variations in LLMs are model-specific rather than universal. While ChatGPT achieved highest absolute scores, Claude’s language independence may offer more reliable performance in non-English clinical settings. These findings have implications for deployment of AI in multilingual healthcare environments.

Version published to 10.21203/rs.3.rs-8754479/v1 on Research Square
Feb 20, 2026

Accuracy and Consistency of Frontier LLMs on Orthodontic Diagnostic Tasks: A Repeated-Trial Comparison

This article has 5 authors:
1. Kang Wan Jing
2. Jonathan Sim
3. Eugene Loh Eu-Min
4. Arthur Lim Chong Yang
5. Kelvin Weng Chiong Foong
This article has no evaluationsLatest version May 20, 2026
Benchmarking General-Purpose and Medical AI Large Language Models for Clinical Assessment and Management in Parkinson’s Disease

This article has 5 authors:
1. Shechter Yosef
2. Klevor Raymond
3. Kouchache Trycia
4. Bouhadoun Sarah
5. Ronald B Postuma
This article has no evaluationsLatest version May 20, 2026
Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer

This article has 7 authors:
1. Michela Quaranta
2. Yong Sheng Tan
3. Areti Karamanou
4. Evangelos Kalampokis
5. Nicolas M Orsi
6. Diederick DeJong
7. Alexandros Laios
This article has no evaluationsLatest version Apr 11, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Accuracy and Consistency of Frontier LLMs on Orthodontic Diagnostic Tasks: A Repeated-Trial Comparison

Benchmarking General-Purpose and Medical AI Large Language Models for Clinical Assessment and Management in Parkinson’s Disease

Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer