Evaluating Large Language Models for Psychometric Simulation Studies in R: Integrating Best Practices in Simulation and Prompt Design

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

While large language model (LLM) capabilities have been widely investigated across various domains, limited research has examined their capability to generate end-to-end R code for conducting simulation studies in psychometrics. To address this gap, we evaluated four commonly used reasoning models: ChatGPT-5.1 (Thinking), Claude 4.5 Opus, DeepSeek-V3.2, and Gemini 3 Pro, in generating simulation code corresponding to two published educational measurement studies (one involving Factor Analysis [FA] and another involving Item Response Tree [IRTree] models). Incorporating modern best practices in simulation design, we assessed model performance across code quality (through expert code review), executability (through actually running the simulations), and output quality (through assessing completeness and comparing to the reference study). While all models generated plausible, well-structured code, relative success depended heavily on the simulation context. Notably, overall model performance did not differ significantly in the FA study, with most models performing well. However, in the IRTree study, there were more fundamental issues in quality. Regarding LLM evaluation criteria, model estimation and performance measures consistently received lower ratings than code structure and data generation tasks. When we compared the simulation results to the results in the reference study, we discovered additional issues that the expert code reviewers did not see, such as ways that most LLMs “cheated” by falling back on inappropriate methods that were error-free but incorrect (e.g., using simple imputation rather than expectation-maximization), underscoring the challenges of assessing LLM code quality even for psychometric experts. Nonetheless, the relative success on these overall tasks suggests that LLMs can be effective for generating R code for psychometric simulations when paired with structured prompting (i.e., following the ADEMP framework), effective testing, and strict human oversight.

Article activity feed