Passing the Turing Test: Fine-tuned AI feedback is less detectable than human or prompt-engineered feedback

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

As large language models (LLMs) become increasingly integrated into educational technologies, questions arise about the authenticity and pedagogical value of AI-generated feedback. This study investigates whether human participants can distinguish between feedback written by a human instructor and that generated by an AI model, and how the method of generation (prompt engineering vs. fine-tuning) affects this perception. One hundred participants completed a Turing-test-inspired task in which they evaluated feedback texts and indicated whether they believed it to be written by a human or an AI model. The AI-generated feedback was produced using either prompt engineering or a fine-tuned LLaMA 3.1 model adapted with QLoRA. The results showed that participants correctly identified the prompt engineering feedback as nonhuman in most cases, while the fine-tuned feedback identification was indistinguishable from chance. These findings suggest that fine-tuning can significantly enhance the human-likeness of AI-generated feedback, with implications for the design of scalable, trustworthy feedback systems in education. The study contributes to ongoing discussions about the role of AI in supporting reflective learning and highlights the importance of transparency, trust, and pedagogical alignment in AI-mediated educational environments.

Article activity feed