A Comparative Performance Analysis of AI-Assisted Language Models in Preoperative Patient Education for Mitral Valve Surgery
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Currently, large language models (LLMs) supported by artificial intelligence (AI) are increasingly being utilized in patient education and information delivery within healthcare services. The aim of this study was to perform a comparative analysis of five different LLMs ( i.e. , ChatGPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Pro Preview, DeepSeek-V3, and Microsoft Copilot) in terms of accuracy, completeness, and readability, based on their responses to frequently asked questions in preoperative patient education for mitral valve surgery (MVS). Methods A standardized questionnaire comprising seven frequently asked questions by patients prior to MVS was developed. These questions were presented to each LLM in an identical manner. The responses were evaluated by two academic experts in cardiac surgery using structured assessment criteria across three main dimensions: accuracy, completeness, and readability. For the readability analysis, the Simplified Measure of Gobbledygook (SMOG) Index and the Flesch-Kincaid Reading Ease (FRE) scale were utilized. Results The ChatGPT-4o and Gemini models received statistically significantly higher scores in terms of accuracy and completeness (p < 0.05), while the Claude 3.7 Sonnet model achieved the highest readability scores (p < 0.001). This model provided reader-friendly content using simpler and more comprehensible sentence structures. The Gemini and DeepSeek models demonstrated moderate performance, whereas the Microsoft Copilot model showed limitations in semantic coherence and medical specificity. Some models were found to provide misleading or incomplete information regarding surgical risks, the postoperative course, and potential complications. Conclusions The LLMs represent valuable supplementary tools in patient education processes. However, their implementation in clinical practice must be carefully evaluated, particularly with regard to accuracy and completeness. This study highlights the potential applicability of ChatGPT-4o and Claude models for preoperative patient education in MVS, while emphasizing that all LLMs should be used under the supervision and guidance of healthcare professionals. For LLMs to be reliably utilized in the medical field, improvement in medical accuracy and standardization are essential.