A Comparative Performance Analysis of AI-Assisted Language Models in Preoperative Patient Education for Mitral Valve Surgery

Banu Bahriye Akdağ
Mehmet Şenel Bademci
İhsan Peker
Okay Güven Karaca
Çağrı Kandemir
Barçın Özcem
Hüseyin Durmaz
Meryem Çakır
İrem Özçetin
Hidayet Onur Selçuk

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Currently, large language models (LLMs) supported by artificial intelligence (AI) are increasingly being utilized in patient education and information delivery within healthcare services. The aim of this study was to perform a comparative analysis of five different LLMs ( i.e. , ChatGPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Pro Preview, DeepSeek-V3, and Microsoft Copilot) in terms of accuracy, completeness, and readability, based on their responses to frequently asked questions in preoperative patient education for mitral valve surgery (MVS). Methods A standardized questionnaire comprising seven frequently asked questions by patients prior to MVS was developed. These questions were presented to each LLM in an identical manner. The responses were evaluated by two academic experts in cardiac surgery using structured assessment criteria across three main dimensions: accuracy, completeness, and readability. For the readability analysis, the Simplified Measure of Gobbledygook (SMOG) Index and the Flesch-Kincaid Reading Ease (FRE) scale were utilized. Results The ChatGPT-4o and Gemini models received statistically significantly higher scores in terms of accuracy and completeness (p < 0.05), while the Claude 3.7 Sonnet model achieved the highest readability scores (p < 0.001). This model provided reader-friendly content using simpler and more comprehensible sentence structures. The Gemini and DeepSeek models demonstrated moderate performance, whereas the Microsoft Copilot model showed limitations in semantic coherence and medical specificity. Some models were found to provide misleading or incomplete information regarding surgical risks, the postoperative course, and potential complications. Conclusions The LLMs represent valuable supplementary tools in patient education processes. However, their implementation in clinical practice must be carefully evaluated, particularly with regard to accuracy and completeness. This study highlights the potential applicability of ChatGPT-4o and Claude models for preoperative patient education in MVS, while emphasizing that all LLMs should be used under the supervision and guidance of healthcare professionals. For LLMs to be reliably utilized in the medical field, improvement in medical accuracy and standardization are essential.

Version published to 10.21203/rs.3.rs-6965764/v1 on Research Square
Sep 9, 2025

Risk Prediction in Spine Surgery: Traditional Models, Artificial Intelligence, and the Challenge of Clinical Translation

This article has 5 authors:
1. Samer Salman
2. Rohan Phadke
3. Rahul Kumar
4. Arbaz Momin
5. Alireza Tavakkoli
This article has no evaluationsLatest version Jan 8, 2026
Artificial Intelligence in Clinical Practice: Evaluating Chatbot Performance on Board-Level Questions in Geriatrics

This article has 2 authors:
1. Mert Zure
2. Metin Sökmen
This article has no evaluationsLatest version Jan 21, 2026
Benchmarking large language models for cardiovascular risk stratification using clinical vignettes

This article has 11 authors:
1. José Ferreira Santos
2. Regina Brito Duarte
3. Inês Mota
4. Rita Carvalheira Santos
5. José Maria Moreira
6. Joana Campos
7. Nuno André Silva
8. Bernardo Neves
9. Ricardo Ladeiras-Lopes
10. Francisca Leite
11. Helder Dores
This article has no evaluationsLatest version Dec 30, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Risk Prediction in Spine Surgery: Traditional Models, Artificial Intelligence, and the Challenge of Clinical Translation

Artificial Intelligence in Clinical Practice: Evaluating Chatbot Performance on Board-Level Questions in Geriatrics

Benchmarking large language models for cardiovascular risk stratification using clinical vignettes