An AI Score to Objectively Assess the Performance of Educational Chatbots
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The rapid integration of AI chatbots into education has created a need for objective methods to evaluate their pedagogical performance. This study introduces an AI Score, a composite metric designed to benchmark educational chatbots across four criteria: their initial performance, their robustness, their self-correction ability and their lack of reliability. The AI Score is calculated using a weighted formula and validated through a standardized test comprising highly discriminant multiple-choice questions. To validate the test, six platforms—ChatGPT, Copilot Studio, NotebookLM, Grok, Mistral, and ClaudeAI—were evaluated under identical conditions using Retrieval-Augmented Generation and class-specific resources. Results demonstrate the AI Score’s ability to differentiate chatbot performance. The methodology aligns with ISO/IEC standards for AI reliability and governance, offering educators a reproducible framework for pre-deployment assessment. Limitations and future directions, including longitudinal studies, qualitative evaluation of answer quality, and adaptation to other domains, are further discussed.