LLMs in the Lab: Can AI Predict What Real Participants Do?
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Can large language models (LLMs) simulate participant-level datasets from experimental designs such that their statistical properties, such as effect directions, magnitudes, and significance, align with those of actual human data? In this work, we tested whether LLMs can generate simulated datasets that reproduce the core findings of real randomized controlled trials (RCTs) using only the information provided in a study’s pre-registration. We assessed whether this alignment generalizes across different LLMs (ChatGPT, Gemini, Perplexity) and across distinct experimental domains, including a math reasoning task comparing student performance and a social judgment task. We found that LLM-simulated datasets mirrored the real data in effect direction and successfully recovered the original patterns of statistical significance. All models correctly reproduced the direction of human effects, though effect magnitudes varied by model, with Gemini consistently overestimating effects and Perplexity showing the closest alignment to human data. While LLMs cannot replace empirical studies, our study offers a powerful and flexible complement capable of accelerating idea testing, refining study designs, and probing the robustness of research findings before conducting real-world experiments.