Augmenting thesis supervision with generative AI feedback: Evaluating the quality and perceived usefulness GenAI feedback

Remco Jongkind
Myrthe Heikens
Lotte Barmentloo
Erik Elings
Floor van der Steijle
Lisa-Maria van Klaveren

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Delivering timely, high-quality feedback on long-form academic student writing is hard to scale. Therefore, we developed a generative AI feedback tool for undergraduate medical theses. Feedback was based on 36 criteria, and feedback quality was evaluated on six aspects. Feedback consistency was assessed across five repeated iterations, robustness across thesis types and grades, and compared GenAI with human feedback. Finally, we examined students’ perceived usefulness and added value. Thirteen theses (9 systematic reviews, 2 narrative reviews, 2 empirical studies) were evaluated, yielding 14.040 feedback aspect ratings. Median GenAI feedback quality was 1.0 (IQR=0.167, scale 0-1), with most criteria exceeding the set threshold of 0.8. Consistency surpassed 0.8 for the quality aspects of feedback, feedforward, thesis-specific, and criterion-based; thesis-accurate and criterion-accurate did not meet the threshold. Quality and consistency did not differ by thesis type or grade. Relative to 215 human feedback comments on the same theses, GenAI feedback showed higher median quality (1.00 vs. 0.33). During in-class sessions, students (n = 30) rated the feedback as useful, sufficient, and of added value (median 6–7, scale 1–7) and favored future use as a supplement rather than a replacement for supervisor feedback. Limitations of the developed GenAI-tool include difficulty determining thesis type, inability to assess figures and cross-references across chapters, reducing feedback quality. Overall, findings support a hybrid model: GenAI offers rapid, comprehensive formative feedback, while educators provide contextual expertise, literature awareness, and mentorship. This division of labor expands access to actionable feedback at scale without displacing essential human judgment.

Version published to 10.21203/rs.3.rs-8352038/v1 on Research Square
Feb 10, 2026

The Unreliable Judges: Assessing Reproducibility and Self-Preference Bias of LLMs as Free-Text Evaluators

This article has 4 authors:
1. J I Alvarez-Arenas
2. D Jimenez-Carretero
3. D Mañanes
4. F Sanchez-Cabo
This article has no evaluationsLatest version Jun 17, 2026
Automating Screening of Titles and Abstracts in Systematic Reviews: An Assessment of GPT-4o mini

This article has 5 authors:
1. Mir Sohail Fazeli
2. Ellen Kasireddy
3. Mir-Masoud Pourrahmat
4. Cuthbert Chow
5. Jean-Paul Collet
This article has no evaluationsLatest version May 20, 2026
Practice what you preach: Designing student assignments that advance open and reproducible science

This article has 3 authors:
1. James Edward Bartlett
2. Gaby Mahrholz
3. Emily Nordmann
This article has no evaluationsLatest version May 14, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

The Unreliable Judges: Assessing Reproducibility and Self-Preference Bias of LLMs as Free-Text Evaluators

Automating Screening of Titles and Abstracts in Systematic Reviews: An Assessment of GPT-4o mini

Practice what you preach: Designing student assignments that advance open and reproducible science