An AI Score to Objectively Assess the Performance of Educational Chatbots

Miguël Dhyne
Jean-Roch Meurisse
Laurence Dumortier
Michaël Lobet

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid integration of AI chatbots into education has created a need for objective methods to evaluate their pedagogical performance. This study introduces an AI Score, a composite metric designed to benchmark educational chatbots across four criteria: their initial performance, their robustness, their self-correction ability and their lack of reliability. The AI Score is calculated using a weighted formula and validated through a standardized test comprising highly discriminant multiple-choice questions. To validate the test, six platforms—ChatGPT, Copilot Studio, NotebookLM, Grok, Mistral, and ClaudeAI—were evaluated under identical conditions using Retrieval-Augmented Generation and class-specific resources. Results demonstrate the AI Score’s ability to differentiate chatbot performance. The methodology aligns with ISO/IEC standards for AI reliability and governance, offering educators a reproducible framework for pre-deployment assessment. Limitations and future directions, including longitudinal studies, qualitative evaluation of answer quality, and adaptation to other domains, are further discussed.

Version published to 10.21203/rs.3.rs-8661596/v1 on Research Square
Apr 13, 2026

ARPG+: Teaching Students to Ask Effective Questions for Educational LLM Use

This article has 6 authors:
1. Pei-Gen Ye
2. Kanghua Mo
3. Yucheng Long
4. Mengyun Liu
5. Haiwei Sang
6. Jun Zheng
This article has no evaluationsLatest version Apr 15, 2026
Psychometric Performance and Student Perceptions of AI- versus Human-Generated Multiple-Choice Questions: The AHEAD Randomized Controlled Trial

This article has 14 authors:
1. Dheyaa Al-Najafi
2. Katherine D. Krause
3. Yundi Wang
4. Qi Kang Zuo
5. Maya Koblanski
6. Cameron J. Leong
7. Emma Schmidt
8. Muhammad Faran
9. Vanay Verma
10. Ravi Vyas
11. Matthew Campbell
12. Jaehyun Hwang
13. Jiawen Deng
14. Anita Palepu
This article has no evaluationsLatest version Apr 3, 2026
Effectiveness and Students’ Perception of Tutor-Guided AI Navigation in Undergraduate Medical Education in Sri Lanka: A Quasi-Experimental Study

This article has 4 authors:
1. Randombage PJS
2. Ahamed MYS
3. Kulathunga TD
4. H Wickramasekara
This article has no evaluationsLatest version Apr 3, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

ARPG+: Teaching Students to Ask Effective Questions for Educational LLM Use

Psychometric Performance and Student Perceptions of AI- versus Human-Generated Multiple-Choice Questions: The AHEAD Randomized Controlled Trial

Effectiveness and Students’ Perception of Tutor-Guided AI Navigation in Undergraduate Medical Education in Sri Lanka: A Quasi-Experimental Study