ChatGPT Provides High-Quality Responses to Patient Questions: A Multi-Rater Evaluation by Anesthesiology Experts

Yasemin Akcaalan
Ezgi Erkilic
Handan Gulec
Tulin Gumus
Orhan Kanbak
Levent Ozturk

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Backround: This study aimed to evaluate the quality and reliability of responses generated by ChatGPT-4.0 to frequently asked patient questions, using expert ratings by anesthesiology and reanimation specialists. Methods: A total of 22 common patient questions were submitted to ChatGPT-4.0. The responses were independently evaluated by five anesthesiology and reanimation specialists using a 4-point Likert-type scale (1 = Excellent, 4 = Unsatisfactory). Inter-rater reliability was assessed using the Intraclass Correlation Coefficient (ICC[2,1]). Results: Of the 110 total ratings, 61.8% were classified as "excellent," 32.7% as "satisfactory requiring minimal clarification," and 5.5% as "satisfactory requiring moderate clarification." No responses were rated as "unsatisfactory." The average score per question ranged from 1.0 to 2.4. Reviewer-wise average scores ranged from 1.27 to 1.73. The overall inter-rater agreement was poor to fair, with an ICC of 0.25. Conclusion: ChatGPT-4.0 was able to produce high-quality responses to patient questions as perceived by medical specialists. However, the low inter-observer agreement underscores the importance of expert oversight when using AI tools in clinical communication.

Version published to 10.21203/rs.3.rs-6882876/v1 on Research Square
Jun 25, 2025

Evaluating the Impact of Authoritative and Subjective Cues on Large Language Model Reliability for Clinical Inquiries: An Experimental Study

This article has 4 authors:
1. Yu Chang
2. Po-Chung Ju
3. Ming-Hong Hsieh
4. Cheng-Chen Chang
This article has no evaluationsLatest version Jul 16, 2025
Evaluating ChatGPT Responses to Frequently Asked Questions on Total Knee Arthroplasty

This article has 9 authors:
1. Yilun Jiang
2. Jiesheng Zhu
3. Yuanyuan Lin
4. Zheng Su
5. Libing Zhang
6. Zhen Dong
7. Qiong Song
8. Pei Fan
9. Zhenxing Li
This article has no evaluationsLatest version Jul 23, 2025
Performance of Large Language Artificial Intelligence Models on Clear Aligner Treatment: Evaluation of Accuracy and Readibility in Answering Questions for Patients

This article has 4 authors:
1. Osman Bahadır TOPCU
2. Güneş Kadriye Tiftikçi
3. Merve Aksöz
4. Furkan Dindaroğlu
This article has no evaluationsLatest version Jul 11, 2025

Listed in

Abstract

Article activity feed

Related articles

Evaluating the Impact of Authoritative and Subjective Cues on Large Language Model Reliability for Clinical Inquiries: An Experimental Study

Evaluating ChatGPT Responses to Frequently Asked Questions on Total Knee Arthroplasty

Performance of Large Language Artificial Intelligence Models on Clear Aligner Treatment: Evaluation of Accuracy and Readibility in Answering Questions for Patients