ChatGPT Provides High-Quality Responses to Patient Questions: A Multi-Rater Evaluation by Anesthesiology Experts

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Backround: This study aimed to evaluate the quality and reliability of responses generated by ChatGPT-4.0 to frequently asked patient questions, using expert ratings by anesthesiology and reanimation specialists. Methods: A total of 22 common patient questions were submitted to ChatGPT-4.0. The responses were independently evaluated by five anesthesiology and reanimation specialists using a 4-point Likert-type scale (1 = Excellent, 4 = Unsatisfactory). Inter-rater reliability was assessed using the Intraclass Correlation Coefficient (ICC[2,1]). Results: Of the 110 total ratings, 61.8% were classified as "excellent," 32.7% as "satisfactory requiring minimal clarification," and 5.5% as "satisfactory requiring moderate clarification." No responses were rated as "unsatisfactory." The average score per question ranged from 1.0 to 2.4. Reviewer-wise average scores ranged from 1.27 to 1.73. The overall inter-rater agreement was poor to fair, with an ICC of 0.25. Conclusion: ChatGPT-4.0 was able to produce high-quality responses to patient questions as perceived by medical specialists. However, the low inter-observer agreement underscores the importance of expert oversight when using AI tools in clinical communication.

Article activity feed