Clinical Safety of Large Language Models in Oral Cancer–Related Patient Communication: A Longitudinal Study

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Oral cancer is still a major global health problem. It is among the most common malignancies worldwide and continues to cause significant morbidity and mortality. As patients increasingly consult large language models (LLMs) for health information prior to professional evaluation, assessing the clinical safety and reliability of AI-generated responses in oral oncology has become essential. Methods: This prospective longitudinal comparative study evaluated two advanced LLMs (Google Gemini Pro and xAI Grok-1) over a 7-day period. Twenty standardized Turkish-language oral cancer–related patient scenarios were submitted daily to each model, yielding 280 total responses. Scientific accuracy and completeness were assessed using a 5-point Likert scale by two independent oral and maxillofacial radiologists. Objective readability was measured using validated Turkish formulas (Ateşman and Bezirci–Yılmaz). Referral safety was evaluated as a binary outcome. Temporal stability was assessed using Cronbach’s alpha, and inter-model agreement was analyzed using intraclass correlation coefficients (ICC(2,1)). Results: Mean scientific accuracy scores were 3.52 ± 0.57 for Gemini and 3.39 ± 0.68 for Grok (p = 0.072). Completeness scores were 3.40 ± 0.70 and 3.25 ± 0.78, respectively (p = 0.091). Grok generated significantly longer sentences (14.83 ± 1.16 vs. 12.61 ± 0.49; p = 0.0005), although overall readability indices did not differ significantly. Referral-safe responses were observed in 90.0% of Gemini and 92.1% of Grok outputs (p = 0.536). Temporal reliability was high (Gemini α = 0.942; Grok α = 0.886). Inter-model agreement was moderate for scientific accuracy (ICC = 0.58) and completeness (ICC = 0.50). Conclusions: Contemporary LLMs demonstrated moderate-to-high scientific accuracy and strong referral safety in oral cancer–related scenarios. While they appear to favor clinical caution over false reassurance, variability in linguistic structure and inter-model agreement highlights the need for clinician oversight. LLMs may serve as informational adjuncts but should not replace professional evaluation in suspected oral malignancies.

Article activity feed