Generating multiple-choice items for a B2 English reading test with GPT-4: targeting higher-order cognitive processing
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study investigated the potential of generative AI to produce multiple-choice reading comprehension items for B2-level English assessment, with a focus on higher-order cognitive processing. Using GPT-4 configured within a custom environment, 164 items were generated from six authentic texts aligned with official test specifications of the Escoles Oficials d’Idiomes (Catalonia). Items underwent expert review and were trialled with 775 test-takers. A triangulated analysis combined linguistic analysis, expert judgements, psychometric modelling, and test-taker feedback. Findings showed that GPT-4 frequently attempted to target higher-order cognitive processing, but the resulting items were often misclassified and suffered from flaws such as implausible distractors and text misinterpretation. An item generation log revealed unstable model behaviour across rounds. Linguistic analysis of item stems highlighted formulaic structures and GPT-4’s confusion regarding the cognitive processing required for item completion. Expert reviewers confirmed that most items required substantial revision, with distractor plausibility and construct alignment as recurrent concerns. Psychometric indices indicated that the items exhibited acceptable model fit and discrimination but were generally easy for the trial group. The study concludes that GenAI can replicate surface features of items targeting higher-order cognitive processing, but rarely provides substantive coverage of complex reading processes.