Augmenting thesis supervision with generative AI feedback: Evaluating the quality and perceived usefulness GenAI feedback

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Delivering timely, high-quality feedback on long-form academic student writing is hard to scale. Therefore, we developed a generative AI feedback tool for undergraduate medical theses. Feedback was based on 36 criteria, and feedback quality was evaluated on six aspects. Feedback consistency was assessed across five repeated iterations, robustness across thesis types and grades, and compared GenAI with human feedback. Finally, we examined students’ perceived usefulness and added value. Thirteen theses (9 systematic reviews, 2 narrative reviews, 2 empirical studies) were evaluated, yielding 14.040 feedback aspect ratings. Median GenAI feedback quality was 1.0 (IQR=0.167, scale 0-1), with most criteria exceeding the set threshold of 0.8. Consistency surpassed 0.8 for the quality aspects of feedback, feedforward, thesis-specific, and criterion-based; thesis-accurate and criterion-accurate did not meet the threshold. Quality and consistency did not differ by thesis type or grade. Relative to 215 human feedback comments on the same theses, GenAI feedback showed higher median quality (1.00 vs. 0.33). During in-class sessions, students (n = 30) rated the feedback as useful, sufficient, and of added value (median 6–7, scale 1–7) and favored future use as a supplement rather than a replacement for supervisor feedback. Limitations of the developed GenAI-tool include difficulty determining thesis type, inability to assess figures and cross-references across chapters, reducing feedback quality. Overall, findings support a hybrid model: GenAI offers rapid, comprehensive formative feedback, while educators provide contextual expertise, literature awareness, and mentorship. This division of labor expands access to actionable feedback at scale without displacing essential human judgment.

Article activity feed