Clinical Safety of Large Language Models in Oral Cancer–Related Patient Communication: A Longitudinal Study

Burcu Yeliz KOLLAYAN
Tuğba CEBECİ

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: Oral cancer is still a major global health problem. It is among the most common malignancies worldwide and continues to cause significant morbidity and mortality. As patients increasingly consult large language models (LLMs) for health information prior to professional evaluation, assessing the clinical safety and reliability of AI-generated responses in oral oncology has become essential. Methods: This prospective longitudinal comparative study evaluated two advanced LLMs (Google Gemini Pro and xAI Grok-1) over a 7-day period. Twenty standardized Turkish-language oral cancer–related patient scenarios were submitted daily to each model, yielding 280 total responses. Scientific accuracy and completeness were assessed using a 5-point Likert scale by two independent oral and maxillofacial radiologists. Objective readability was measured using validated Turkish formulas (Ateşman and Bezirci–Yılmaz). Referral safety was evaluated as a binary outcome. Temporal stability was assessed using Cronbach’s alpha, and inter-model agreement was analyzed using intraclass correlation coefficients (ICC(2,1)). Results: Mean scientific accuracy scores were 3.52 ± 0.57 for Gemini and 3.39 ± 0.68 for Grok (p = 0.072). Completeness scores were 3.40 ± 0.70 and 3.25 ± 0.78, respectively (p = 0.091). Grok generated significantly longer sentences (14.83 ± 1.16 vs. 12.61 ± 0.49; p = 0.0005), although overall readability indices did not differ significantly. Referral-safe responses were observed in 90.0% of Gemini and 92.1% of Grok outputs (p = 0.536). Temporal reliability was high (Gemini α = 0.942; Grok α = 0.886). Inter-model agreement was moderate for scientific accuracy (ICC = 0.58) and completeness (ICC = 0.50). Conclusions: Contemporary LLMs demonstrated moderate-to-high scientific accuracy and strong referral safety in oral cancer–related scenarios. While they appear to favor clinical caution over false reassurance, variability in linguistic structure and inter-model agreement highlights the need for clinician oversight. LLMs may serve as informational adjuncts but should not replace professional evaluation in suspected oral malignancies.

Version published to 10.21203/rs.3.rs-9030646/v1 on Research Square
Mar 16, 2026

Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer

This article has 7 authors:
1. Michela Quaranta
2. Yong Sheng Tan
3. Areti Karamanou
4. Evangelos Kalampokis
5. Nicolas M Orsi
6. Diederick DeJong
7. Alexandros Laios
This article has no evaluationsLatest version Apr 11, 2026
Development and psychometric validation of a patient-reported core symptom set for lung cancer patients undergoing chemotherapy

This article has 10 authors:
1. Ye Yang
2. Juan Li
3. Cheng Lei
4. Xiangyu Tan
5. Min Zheng
6. Xiaoling Liu
7. Zixin Wei
8. Sudan Zheng
9. Qiuling Shi
10. Xi Luo
This article has no evaluationsLatest version Apr 14, 2026
Diagnostic Performance and Cost-Efficiency of Large Language Models in Secondary Hypertension: A Blinded Comparative Study

This article has 4 authors:
1. Asena Gökçay Canpolat
2. Özge Baş Aksu
3. Rıfat Emral
4. Uğur Canpolat
This article has no evaluationsLatest version Mar 18, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Evaluating 11 Large Language Models in Answering Key Questions on Ovarian Cancer

Development and psychometric validation of a patient-reported core symptom set for lung cancer patients undergoing chemotherapy

Diagnostic Performance and Cost-Efficiency of Large Language Models in Secondary Hypertension: A Blinded Comparative Study