Looks good on paper: LLM-generated exams are face-valid but psychometrically weaker than human assessments
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Practice exams support learning but are labor-intensive to create, prepare, and implement. Large language models (LLMs) may help to reduce this burden, but the quality of LLM-generated assessment items has varied in research to date. Across two studies, we evaluate PREPARE, a comprehensive LLM-based workflow aimed at high-quality multiple-choice question generation. In Study 1, we examined the perceived relative quality of LLM-generated items, and found that university lecturers’ (N = 109) perceptions of LLM-generated item quality were broadly similar to their perceptions of human-generated items. Specifically, when required to select questions for a mock exam, lecturers’ chosen exams consisted of 45% LLM-generated questions. In Study 2, we then examined the relative psychometric integrity of LLM-generated items, with first-year undergraduate students (N = 336) completing a series of assessments containing both LLM-generated and instructor-generated items. In contrast to Study 1, we found that LLM-generated items appeared overall psychometrically weaker than their human-generated counterparts. Together, our results suggest a disconnect between perceived quality and psychometric performance: LLM-generated questions may look the part, but this obfuscates psychometric deficits. LLM-based item generation may be useful for low-stakes practice, but further refinement is needed to achieve item quality equivalent to human instructors.