Passing the Turing Test: Fine-tuned AI feedback is less detectable than human or prompt-engineered feedback

Peter Ruijten-Dodoiu
Manuel Oliveira

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

As large language models (LLMs) become increasingly integrated into educational technologies, questions arise about the authenticity and pedagogical value of AI-generated feedback. This study investigates whether human participants can distinguish between feedback written by a human instructor and that generated by an AI model, and how the method of generation (prompt engineering vs. fine-tuning) affects this perception. One hundred participants completed a Turing-test-inspired task in which they evaluated feedback texts and indicated whether they believed it to be written by a human or an AI model. The AI-generated feedback was produced using either prompt engineering or a fine-tuned LLaMA 3.1 model adapted with QLoRA. The results showed that participants correctly identified the prompt engineering feedback as nonhuman in most cases, while the fine-tuned feedback identification was indistinguishable from chance. These findings suggest that fine-tuning can significantly enhance the human-likeness of AI-generated feedback, with implications for the design of scalable, trustworthy feedback systems in education. The study contributes to ongoing discussions about the role of AI in supporting reflective learning and highlights the importance of transparency, trust, and pedagogical alignment in AI-mediated educational environments.

Version published to 10.21203/rs.3.rs-7486768/v1 on Research Square
Sep 3, 2025

Exploring the Quality and Effectiveness of AI-Generated Feedback in Introductory Programming

This article has 3 authors:
1. Yizhou Qian
2. Meishan Liu
3. Liye Zhu
This article has no evaluationsLatest version Oct 1, 2025
When machines judge humanness : findings from an interactive reverse Turing test by large language models

This article has 4 authors:
1. Marc Raynaud
2. Loïc Raynaud
3. Agathe Truchot
4. Alexandre Loupy
This article has no evaluationsLatest version Oct 6, 2025
Using AI-Generated Prequestions to Improve Memory and Text Comprehension

This article has 5 authors:
1. Steven C. Pan
2. Judith Schweppe
3. Andy Z. J. Teo
4. Alyssa Indrajaya
5. Niklas Wenzel
This article has no evaluationsLatest version Sep 7, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Exploring the Quality and Effectiveness of AI-Generated Feedback in Introductory Programming

When machines judge humanness : findings from an interactive reverse Turing test by large language models

Using AI-Generated Prequestions to Improve Memory and Text Comprehension