Comparative efficacy of ChatGPT-5.1 Auto and DeepSeek-V3.1 Thinking in answering patients’ questions on cervical spine surgery

Xiaoyang Huo
Jiaming Zhou
Rongzhi Ma
Yuan Xue

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background This study aimed to evaluate and compare the performance of two artificial intelligence (AI) large language models, ChatGPT-5.1 Auto and DeepSeek-V3.1 Thinking, in generating responses to common patient questions regarding cervical spine surgery. The assessment focused on the quality of the generated content for pre-operative education across five key dimensions: accuracy, clarity, completeness, consistency, and readability. Methods Twenty frequently asked questions concerning cervical spine surgery were identified through a review of Google search trends. Identical queries were submitted to both AI models. Responses were evaluated independently by five experienced spine surgeons using a 5-point Likert scale to assess the first four quality dimensions. Inter-rater reliability was determined via the Intraclass Correlation Coefficient (ICC). Text readability was objectively analyzed using the Flesch–Kincaid Grade Level (FKGL) and Flesch Reading Ease Score (FRES) metrics. Mean scores for each criterion were compared between models using the Wilcoxon signed-rank test, with statistical significance set at P < 0.05. Results Both models demonstrated high performance in accuracy, clarity, and consistency. A significant difference was observed in completeness: ChatGPT-5.1 Auto provided more comprehensive responses (mean completeness 4.65 vs. 4.15, P < 0.001). DeepSeek-V3.1 Thinking had a lower FKGL (7.80 vs. 10.60, P < 0.001) and a higher FRES (61.70 vs. 39.50, P < 0.001), indicating its text was easier to read for the general audience. ICC analysis indicated a high degree of inter-rater agreement (composite ICC = 0.869). Conclusion Both ChatGPT-5.1 Auto and DeepSeek-V3.1 Thinking can provide accurate, clear, and consistent information for pre-operative education of cervical spine surgery patients. While ChatGPT-5.1 Auto excelled in completeness, DeepSeek-V3.1 Thinking demonstrated superior readability. Leveraging the combined advantages of both models could optimize the effectiveness of AI-assisted pre-operative patient education.

Version published to 10.21203/rs.3.rs-8551438/v1 on Research Square
Jan 23, 2026

Validity and reliability of AI chatbots on the comparative diagnosis and definitive management of deep caries based on position statements evaluated from post-graduate students and clinicians' perspectives

This article has 5 authors:
1. Ashish J Johnson
2. Tarun Kumar Singh
3. R Periyasamy
4. Aakash Gupta
5. Ikroop Gill
This article has no evaluationsLatest version Jan 30, 2026
Artificial Intelligence in Clinical Practice: Evaluating Chatbot Performance on Board-Level Questions in Geriatrics

This article has 2 authors:
1. Mert Zure
2. Metin Sökmen
This article has no evaluationsLatest version Jan 21, 2026
Benchmarking Large Language Models on Persian Surgical Subspecialty Board Examinations: A Comparative Study of ChatGPT-4o, ChatGPT-5, and Gemini 2.5 Flash

This article has 5 authors:
1. Shahab Sheikhalishahi
2. Farzad Rafiei
3. Seyed Masoud Hosseini
4. Alireza Haddadi
5. Saina Sadeghipour
This article has no evaluationsLatest version Feb 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Validity and reliability of AI chatbots on the comparative diagnosis and definitive management of deep caries based on position statements evaluated from post-graduate students and clinicians' perspectives

Artificial Intelligence in Clinical Practice: Evaluating Chatbot Performance on Board-Level Questions in Geriatrics

Benchmarking Large Language Models on Persian Surgical Subspecialty Board Examinations: A Comparative Study of ChatGPT-4o, ChatGPT-5, and Gemini 2.5 Flash