Evaluating ChatGPT-4o’s Web-Enhanced Responses in Patient Education: Ankle Stabilization Surgery as a Case Study
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Artificial intelligence (AI) is increasingly used in healthcare for patient education, clinical decision support, and medical information dissemination. ChatGPT-4o, an advanced large language model (LLM) with web search functionality, claims to improve response accuracy and relevance. However, its reliability and ability to provide comprehensive and evidence-based medical information remain uncertain. This study evaluates the quality, readability, and accuracy of ChatGPT-4o’s responses to common patient inquiries regarding ankle stabilization surgery. Methods: On January 30, 2025, ChatGPT-4o was prompted with frequently asked questions about ankle stabilization surgery. The web search function was enabled to enhance response accuracy. Three independent reviewers assessed the AI-generated responses using the DISCERN tool for quality and the Flesch-Kincaid metrics for readability. Inter-rater reliability was calculated to determine rating consistency. Results: The readability analysis revealed that ChatGPT-4o’s responses were highly complex (Flesch-Kincaid Reading Ease Score = 23.07, Grade Level = 13.9), requiring a college-level education for comprehension. DISCERN scores ranged from 47 to 58, indicating moderate quality. The inter-rater reliability score of 0.73 demonstrated substantial agreement. Limitations included overly optimistic recovery timelines, lack of authoritative citations, and insufficient discussion of surgical risks. Conclusion: ChatGPT-4o provides structured and accessible medical information but exhibits limitations in accuracy, transparency, and risk disclosure. Future improvements should focus on enhancing source reliability, readability, and personalization to improve AI-assisted patient education.