Exploring the Quality and Effectiveness of AI-Generated Feedback in Introductory Programming

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Feedback is a vital but often difficult part of introductory programming courses, where standard compiler messages are vague and confusing for students. Generative artificial intelligence (GenAI) has become a promising tool for providing improved feedback in programming education, yet empirical studies on its effectiveness in real educational settings are limited. Using a design-based research approach, this study examined both the quality and instructional impact of AI-generated feedback in an introductory Python programming course. Two cohorts of undergraduate students were participants. The first cohort received data-driven feedback (DDF Group, n  = 28), and the second cohort, who used an upgraded automated assessment tool, received AI-generated feedback generated by the Llama 3-8B model (AIF Group, n  = 32). Quality was assessed through expert ratings using a 0–5 point rubric and student perception surveys. Effectiveness was evaluated through debugging performance metrics and final exam scores. Expert evaluation of 1,490 AI-generated feedback messages revealed concerning quality issues, with a mean rating of 1.84 out of 5 and over 40% receiving the lowest possible score. Common quality issues included excessive redundant information, content exceeding students’ knowledge scope, and misleading explanations. Students reported significantly lower perceived usefulness for AI-generated feedback compared to data-driven feedback. The AIF Group also exhibited poorer debugging performance and achieved lower final programming exam scores. Contrary to expectations, AI-generated feedback was less effective than data-driven feedback in supporting student learning. This study highlights the need for rigorous design, prompt refinement, and contextual alignment when deploying GenAI tools in educational contexts.

Article activity feed