Can ChatGPT Replace the Teacher in Assessment? A Review of Research on the Use of Large Language Models in Grading and Providing Feedback

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This article presents a systematic review of empirical research on the use of large language models (LLMs) for automated grading of student work and providing feedback. The study aimed to determine the extent to which generative artificial intelligence models, such as ChatGPT, can replace teachers in the assessment process. The review was conducted in accordance with PRISMA guidelines and predefined inclusion criteria; ultimately, 42 empirical studies were included in the analysis. The results of the review indicate that the effectiveness of LLMs in grading is varied. These models perform well on closed-ended tasks and short-answer questions, often achieving accuracy comparable to human evaluators. However, they struggle with assessing complex, open-ended, or subjective assignments that require in-depth analysis or creativity. The quality of the prompts provided to the model and the use of detailed scoring rubrics significantly influence the accuracy and consistency of grades generated by LLMs. The findings suggest that LLMs can support teachers by accelerating the grading process and delivering rapid feedback at scale, but they cannot fully replace human judgment. The highest effectiveness is achieved in hybrid assessment systems that combine AI-driven automatic grading with teacher oversight and verification.

Article activity feed