An AI Score to Objectively Assess the Performance of Educational Chatbots

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rapid integration of AI chatbots into education has created a need for objective methods to evaluate their pedagogical performance. This study introduces an AI Score, a composite metric designed to benchmark educational chatbots across four criteria: their initial performance, their robustness, their self-correction ability and their lack of reliability. The AI Score is calculated using a weighted formula and validated through a standardized test comprising highly discriminant multiple-choice questions. To validate the test, six platforms—ChatGPT, Copilot Studio, NotebookLM, Grok, Mistral, and ClaudeAI—were evaluated under identical conditions using Retrieval-Augmented Generation and class-specific resources. Results demonstrate the AI Score’s ability to differentiate chatbot performance. The methodology aligns with ISO/IEC standards for AI reliability and governance, offering educators a reproducible framework for pre-deployment assessment. Limitations and future directions, including longitudinal studies, qualitative evaluation of answer quality, and adaptation to other domains, are further discussed.

Article activity feed