Looks good on paper: LLM-generated exams are face-valid but psychometrically weaker than human assessments

Jamie Cummins
Sandra Grinschgl
Malte Elson
Natalie Borter
Michael Schulte-Mecklenbeck

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Practice exams support learning but are labor-intensive to create, prepare, and implement. Large language models (LLMs) may help to reduce this burden, but the quality of LLM-generated assessment items has varied in research to date. Across two studies, we evaluate PREPARE, a comprehensive LLM-based workflow aimed at high-quality multiple-choice question generation. In Study 1, we examined the perceived relative quality of LLM-generated items, and found that university lecturers’ (N = 109) perceptions of LLM-generated item quality were broadly similar to their perceptions of human-generated items. Specifically, when required to select questions for a mock exam, lecturers’ chosen exams consisted of 45% LLM-generated questions. In Study 2, we then examined the relative psychometric integrity of LLM-generated items, with first-year undergraduate students (N = 336) completing a series of assessments containing both LLM-generated and instructor-generated items. In contrast to Study 1, we found that LLM-generated items appeared overall psychometrically weaker than their human-generated counterparts. Together, our results suggest a disconnect between perceived quality and psychometric performance: LLM-generated questions may look the part, but this obfuscates psychometric deficits. LLM-based item generation may be useful for low-stakes practice, but further refinement is needed to achieve item quality equivalent to human instructors.

Version published to 10.31234/osf.io/vaf2y_v1 on OSF Preprints
Mar 26, 2026

AI-Assisted Assessment and Instruction in Higher Education: Foundations, Applications, and Implications for Exam Design

This article has 1 author:
1. Florian Klapproth
This article has no evaluationsLatest version Mar 28, 2026
Assessment Format Matters: Evidence for Differences in Metacogni-tive Resolution Between Multiple-Choice and Open-Ended Exams

This article has 1 author:
1. Samuel Parra León
This article has no evaluationsLatest version Mar 25, 2026
A Pilot Test to Assess the Effectiveness of AI-Generated Practice Quizzes on University Student Examination Scores

This article has 1 author:
1. Pradip Shukla
This article has no evaluationsLatest version Feb 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

AI-Assisted Assessment and Instruction in Higher Education: Foundations, Applications, and Implications for Exam Design

Assessment Format Matters: Evidence for Differences in Metacogni-tive Resolution Between Multiple-Choice and Open-Ended Exams

A Pilot Test to Assess the Effectiveness of AI-Generated Practice Quizzes on University Student Examination Scores