ChatGPT Provides High-Quality Responses to Patient Questions: A Multi-Rater Evaluation by Anesthesiology Experts
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Backround: This study aimed to evaluate the quality and reliability of responses generated by ChatGPT-4.0 to frequently asked patient questions, using expert ratings by anesthesiology and reanimation specialists. Methods: A total of 22 common patient questions were submitted to ChatGPT-4.0. The responses were independently evaluated by five anesthesiology and reanimation specialists using a 4-point Likert-type scale (1 = Excellent, 4 = Unsatisfactory). Inter-rater reliability was assessed using the Intraclass Correlation Coefficient (ICC[2,1]). Results: Of the 110 total ratings, 61.8% were classified as "excellent," 32.7% as "satisfactory requiring minimal clarification," and 5.5% as "satisfactory requiring moderate clarification." No responses were rated as "unsatisfactory." The average score per question ranged from 1.0 to 2.4. Reviewer-wise average scores ranged from 1.27 to 1.73. The overall inter-rater agreement was poor to fair, with an ICC of 0.25. Conclusion: ChatGPT-4.0 was able to produce high-quality responses to patient questions as perceived by medical specialists. However, the low inter-observer agreement underscores the importance of expert oversight when using AI tools in clinical communication.