Comparative efficacy of ChatGPT-5.1 Auto and DeepSeek-V3.1 Thinking in answering patients’ questions on cervical spine surgery
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background This study aimed to evaluate and compare the performance of two artificial intelligence (AI) large language models, ChatGPT-5.1 Auto and DeepSeek-V3.1 Thinking, in generating responses to common patient questions regarding cervical spine surgery. The assessment focused on the quality of the generated content for pre-operative education across five key dimensions: accuracy, clarity, completeness, consistency, and readability. Methods Twenty frequently asked questions concerning cervical spine surgery were identified through a review of Google search trends. Identical queries were submitted to both AI models. Responses were evaluated independently by five experienced spine surgeons using a 5-point Likert scale to assess the first four quality dimensions. Inter-rater reliability was determined via the Intraclass Correlation Coefficient (ICC). Text readability was objectively analyzed using the Flesch–Kincaid Grade Level (FKGL) and Flesch Reading Ease Score (FRES) metrics. Mean scores for each criterion were compared between models using the Wilcoxon signed-rank test, with statistical significance set at P < 0.05. Results Both models demonstrated high performance in accuracy, clarity, and consistency. A significant difference was observed in completeness: ChatGPT-5.1 Auto provided more comprehensive responses (mean completeness 4.65 vs. 4.15, P < 0.001). DeepSeek-V3.1 Thinking had a lower FKGL (7.80 vs. 10.60, P < 0.001) and a higher FRES (61.70 vs. 39.50, P < 0.001), indicating its text was easier to read for the general audience. ICC analysis indicated a high degree of inter-rater agreement (composite ICC = 0.869). Conclusion Both ChatGPT-5.1 Auto and DeepSeek-V3.1 Thinking can provide accurate, clear, and consistent information for pre-operative education of cervical spine surgery patients. While ChatGPT-5.1 Auto excelled in completeness, DeepSeek-V3.1 Thinking demonstrated superior readability. Leveraging the combined advantages of both models could optimize the effectiveness of AI-assisted pre-operative patient education.