Looks good on paper: LLM-generated exams are face-valid but psychometrically weaker than human assessments

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Practice exams support learning but are labor-intensive to create, prepare, and implement. Large language models (LLMs) may help to reduce this burden, but the quality of LLM-generated assessment items has varied in research to date. Across two studies, we evaluate PREPARE, a comprehensive LLM-based workflow aimed at high-quality multiple-choice question generation. In Study 1, we examined the perceived relative quality of LLM-generated items, and found that university lecturers’ (N = 109) perceptions of LLM-generated item quality were broadly similar to their perceptions of human-generated items. Specifically, when required to select questions for a mock exam, lecturers’ chosen exams consisted of 45% LLM-generated questions. In Study 2, we then examined the relative psychometric integrity of LLM-generated items, with first-year undergraduate students (N = 336) completing a series of assessments containing both LLM-generated and instructor-generated items. In contrast to Study 1, we found that LLM-generated items appeared overall psychometrically weaker than their human-generated counterparts. Together, our results suggest a disconnect between perceived quality and psychometric performance: LLM-generated questions may look the part, but this obfuscates psychometric deficits. LLM-based item generation may be useful for low-stakes practice, but further refinement is needed to achieve item quality equivalent to human instructors.

Article activity feed