Mixed Methods Assessment of ChatGPT Accuracy and Reliability in Healthcare Queries by Medical Residents
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Artificial intelligence (AI) is gaining traction in healthcare, with Chat GPT used for medical information retrieval. However, there is limited real-world evidence on how medical professionals in regions like India perceive, use these tools. Objective This study evaluates the accuracy, reliability of GPT in addressing healthcare-related queries while exploring its advantages, limitations in medical practice. Methods A mixed-methods study was conducted with 34 residents from 17 specialties at medical college. Participants rated GPT’s responses to five specialty-specific questions using a six-point Likert scale. Intra-rater reliability was tested by repeating queries after 10–15 days; inter-rater reliability was assessed via peer assessments within specialties. Additionally, semi-structured interviews with 7 residents explored perceived benefits, limitations of Chat GPT in clinical settings. The study was conducted in December 2024. Clinical trial number is not applicable. Results The participation response rate was 33.3% (34 out of 102 resident doctors). Of the 170 medical queries assessed, GPT’s responses had a median accuracy score of 5.5 rated between “almost completely correct” and “completely correct.” Binary questions scored slightly higher (median 6) than descriptive ones (median 5). Reliability was strong, with Intraclass Correlation Coefficient (ICC) at 0.82 and inter-rater ICC at 0.79. Qualitative findings highlighted two themes: GPT's utility in clinical research, diagnosis, education; ethical concerns, including medico-legal risks, occasional inaccuracies, limited handling of complex cases, and risk of over-reliance. Conclusion GPT shows high accuracy, reliability with healthcare queries, especially factual ones, but ethical concerns, limitations require ongoing human oversight. Continued improvement of AI tools is needed for safer, wider clinical use.