Evaluating Large Language Models for Psychometric Simulation Studies in R: Integrating Best Practices in Simulation and Prompt Design

Mohammed A. A. Abulela
Ethan C. Brown
Guher Gorgun
Kyle Nickodem
Nana Kim
Justin Kern

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

While large language model (LLM) capabilities have been widely investigated across various domains, limited research has examined their capability to generate end-to-end R code for conducting simulation studies in psychometrics. To address this gap, we evaluated four commonly used reasoning models: ChatGPT-5.1 (Thinking), Claude 4.5 Opus, DeepSeek-V3.2, and Gemini 3 Pro, in generating simulation code corresponding to two published educational measurement studies (one involving Factor Analysis [FA] and another involving Item Response Tree [IRTree] models). Incorporating modern best practices in simulation design, we assessed model performance across code quality (through expert code review), executability (through actually running the simulations), and output quality (through assessing completeness and comparing to the reference study). While all models generated plausible, well-structured code, relative success depended heavily on the simulation context. Notably, overall model performance did not differ significantly in the FA study, with most models performing well. However, in the IRTree study, there were more fundamental issues in quality. Regarding LLM evaluation criteria, model estimation and performance measures consistently received lower ratings than code structure and data generation tasks. When we compared the simulation results to the results in the reference study, we discovered additional issues that the expert code reviewers did not see, such as ways that most LLMs “cheated” by falling back on inappropriate methods that were error-free but incorrect (e.g., using simple imputation rather than expectation-maximization), underscoring the challenges of assessing LLM code quality even for psychometric experts. Nonetheless, the relative success on these overall tasks suggests that LLMs can be effective for generating R code for psychometric simulations when paired with structured prompting (i.e., following the ADEMP framework), effective testing, and strict human oversight.

Version published to 10.31234/osf.io/9tfej_v1 on OSF Preprints
Apr 8, 2026

Can Large Language Models Emulate Human Performance on Educational Assessments?

This article has 4 authors:
1. Xiuxiu Tang
2. Yikai Lu
3. John T. Behrens
4. Ying Cheng
This article has no evaluationsLatest version Apr 23, 2026
Designing Digital Affordances Against the Syntax Barrier: A Blended Learning Design Framework for Computational Thinking Development in Secondary ICT Education

This article has 6 authors:
1. Galiya Saltanova
2. Balgyn Akhmetova
3. Dinara Yesmagambetova
4. Bakhytgul Kazhykenova
5. Meruert Yermukhambetova
6. Jaroslav Kultan
This article has no evaluationsLatest version Apr 17, 2026
ARPG+: Teaching Students to Ask Effective Questions for Educational LLM Use

This article has 6 authors:
1. Pei-Gen Ye
2. Kanghua Mo
3. Yucheng Long
4. Mengyun Liu
5. Haiwei Sang
6. Jun Zheng
This article has no evaluationsLatest version Apr 15, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Can Large Language Models Emulate Human Performance on Educational Assessments?

Designing Digital Affordances Against the Syntax Barrier: A Blended Learning Design Framework for Computational Thinking Development in Secondary ICT Education

ARPG+: Teaching Students to Ask Effective Questions for Educational LLM Use