Can Large Language Models Emulate Human Performance on Educational Assessments?
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) are increasingly used to generate synthetic data for research and evaluation, including simulating student responses to educational assessments. However, their validity for emulating authentic human performance remains unclear. This study evaluates whether LLMs can-generate assessment data that preserve the score distribution, subgroup relationships, item functioning, latent-variable model properties, and overall distributional similarity required for valid measurement. Using fourth-grade mathematics data from TIMSS 2011, we compare synthetic responses generated by GPT-4o and GPT-5 under multiple prompting conditions against empirical student responses and theory-driven simulated data from a cognitive diagnostic model (CDM). Across both LLM architectures, LLM-generated data exhibit systematic score inflation and variance compression, with distortions more pronounced for GPT-5. Item easiness is overestimated, item discriminations are attenuated, and alignment with empirical item properties is weak. Adding information on cognitive mastery profiles and personal background modestly increases variability but does not recover human-like psychometric structure. Although the CDM shows adequate fit to some LLM-generated datasets, item parameter recovery remains poor. In contrast, CDM-based simulations closely reproduce human score distributions, item properties, and latent structure. These results suggest that LLM-generated synthetic assessment data deviate systematically from human data, which limits their usefulness for inferences requiring structural fidelity. Caution is therefore warranted when using LLM-generated synthetic data in measurement contexts where variance, item functioning, and latent structure are central.