Can Large Language Models Emulate Human Performance on Educational Assessments?

Xiuxiu Tang
Yikai Lu
John T. Behrens
Ying Cheng

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) are increasingly used to generate synthetic data for research and evaluation, including simulating student responses to educational assessments. However, their validity for emulating authentic human performance remains unclear. This study evaluates whether LLMs can-generate assessment data that preserve the score distribution, subgroup relationships, item functioning, latent-variable model properties, and overall distributional similarity required for valid measurement. Using fourth-grade mathematics data from TIMSS 2011, we compare synthetic responses generated by GPT-4o and GPT-5 under multiple prompting conditions against empirical student responses and theory-driven simulated data from a cognitive diagnostic model (CDM). Across both LLM architectures, LLM-generated data exhibit systematic score inflation and variance compression, with distortions more pronounced for GPT-5. Item easiness is overestimated, item discriminations are attenuated, and alignment with empirical item properties is weak. Adding information on cognitive mastery profiles and personal background modestly increases variability but does not recover human-like psychometric structure. Although the CDM shows adequate fit to some LLM-generated datasets, item parameter recovery remains poor. In contrast, CDM-based simulations closely reproduce human score distributions, item properties, and latent structure. These results suggest that LLM-generated synthetic assessment data deviate systematically from human data, which limits their usefulness for inferences requiring structural fidelity. Caution is therefore warranted when using LLM-generated synthetic data in measurement contexts where variance, item functioning, and latent structure are central.

Version published to 10.31234/osf.io/btja8_v1 on OSF Preprints
Apr 13, 2026

Human Experts Vs. LLMs: Who is Better at Explaining Students’ Clustering into Knowledge Profiles?

This article has 4 authors:
1. Elad Yacobson
2. Shelley Rap
3. Ron Blonder
4. Giora Alexandron
This article has no evaluationsLatest version Mar 20, 2026
Prompt Engineering for Scale Development in Generative Psychometrics

This article has 2 authors:
1. Lara Lee Russell-Lasalandra
2. Hudson Golino
This article has no evaluationsLatest version Mar 17, 2026
Prompt Engineering for Scale Development in Generative Psychometrics

This article has 2 authors:
1. Lara Lee Russell-Lasalandra
2. Hudson Golino
This article has no evaluationsLatest version Mar 17, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Human Experts Vs. LLMs: Who is Better at Explaining Students’ Clustering into Knowledge Profiles?

Prompt Engineering for Scale Development in Generative Psychometrics

Prompt Engineering for Scale Development in Generative Psychometrics