Psychometric Performance and Student Perceptions of AI- versus Human-Generated Multiple-Choice Questions: The AHEAD Randomized Controlled Trial
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Developing high-quality multiple-choice examinations in medical education is time- and resource-intensive. Large language models (LLMs) offer a promising approach to accelerate question development; however, their utility for exam development remains underexplored. Methods The AHEAD Trial ( A I vs H uman E xam A ssessment and D evelopment) was a participant-blinded, parallel-group randomized controlled trial conducted among first-year medical students. Students were randomized to complete a 112-item case-based, single-best-answer mock examination composed of either AI-generated or human-generated multiple-choice questions (MCQs). Questions were developed using identical curricular objectives. AI-generated items were produced via a dual-model workflow (ChatGPT for generation; Google Gemini for validation); human-generated items were authored by senior medical students. Outcomes were evaluated using Van der Vleuten’s Assessment Utility Framework across feasibility, acceptability, reliability, validity, and educational impact. Primary analyses were conducted in the intention-to-treat (ITT) population using appropriate parametric or non-parametric tests, with effect sizes and 95% confidence intervals reported. Results A total of 258 students were randomized, with 127 allocated to the AI-generated exam arm and 131 to the human-generated exam arm. LLM-assisted MCQ development achieved a 5.6-fold efficiency gain compared with human authorship (4.2 ± 1.9 vs. 19.6 ± 7.5 minutes per item; p < 0.0001). Student perceptions of exam acceptability—including clarity, difficulty, relevance, and educational value—were comparable between AI-generated and human-generated exams (all p > 0.05; effect sizes < 0.5). Human-generated items demonstrated slightly higher discrimination indices than AI-generated items, though the effect size was small, and distractor efficiency did not differ between protocols. Student performance was marginally higher on the human-generated exam, though this difference was not significant in the ITT analysis. Exploratory analyses identified theme-specific performance variation and potential gender performance differences on the AI-generated exam. Neither exam meaningfully changed students’ perceived preparedness. Conclusions LLMs can substantially accelerate MCQ development while producing formative assessments that are psychometrically comparable and acceptable to learners. Although small differences persist, these findings support the integration of LLM-assisted item generation within a human-in-the-loop framework, combining AI efficiency with expert oversight to preserve psychometric quality and equity. Trial registration This study was retrospectively registered on ClinicalTrials.gov (Identifier NCT07481162 registered March 18, 2026). Prospective registration was not performed as the study was conducted as an embedded educational intervention within a voluntary formative examination setting. The study protocol and statistical analysis plan were prespecified prior to data analysis. The trial is reported in accordance with CONSORT 2025 guidelines.