Chatbot Rating Scale (CHARS): Development and Validation of a New Tool for Assessing the Quality of Health and Social Care Chatbots

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Chatbots are increasingly deployed in healthcare, yet there is a dearth of standardized tools for evaluating their quality. The objective of this study was to develop and validate a novel, multidimensional instrument, the Chatbot Rating Scale (CHARS), for assessing the comprehensive quality of health and social care chatbots. Methods This study utilized a sequential mixed-method design to guide the systematic development and validation of the Chatbot Rating Scale (CHARS). The scale was conceptualized through a rigorous Delphi process involving domain experts, information technology experts, graduate students, and end users. To test and validate CHARS, responses from end-users (n = 979) comprising health professionals, researchers, mental health experts, and end-users, rated 36 health and mental health chatbots using CHARS, a 5-point Likert scale. Factorial validity was examined using parallel analysis, exploratory factor analysis (EFA), and confirmatory factor analysis (CFA), and internal consistency was assessed with Cronbach’s alpha in R (version 4.4.2). Ten chatbots were pilot tested before thirty-six chatbots were rated. Results The Delphi study resulted in a refined scale comprising 25 items distributed across five initial dimensions. Empirical analysis supported a three-factor structure, which was: (1) Design, (2) Interactive User Experience (personalization, engagement, and usability), and (3) Functionality, with excellent internal consistencies (Cronbach's alpha) of 0.90, 0.88, and 0.90, respectively. The factor loadings and interrater correlation coefficient for all 25 items of CHARS were high with standardized loadings ranging from 0.79 to 0.87 for Factor 2 (Design, ICC = 0.88), 0.77 to 0.87 for Factor 1 (Interactive User Experience, ICC = 0.98), and from 0.83 to 0.87 for Factor 3 (Functionality, ICC = 0.90), which were all statistically significant (p < .001). All 25 Delphi-derived items were retained and redistributed across the three factors, preserving comprehensive coverage of chatbot quality dimensions. Conclusions CHARS is a novel, psychometrically robust, user-centred instrument that consolidates key experiential and functional attributes into a concise three-dimensional model. Its development highlights the convergence of user perceptions around integrated experiential qualities, offering a practical and theoretically grounded tool for evaluating chatbots. Future research should test its generalizability across cultural and technological contexts and examine predictive validity for user engagement.

Article activity feed