Using large language models as a source of human behavioral data in social science experiments

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language models (LLMs) have prompted proposals to replace human subjects in social science experiments with simulated responses. Empirical evaluations suggest that this practice---often called silicon sampling---can sometimes approximate human behavior but is unreliable. We delineate where this approach may still provide value and where it may not, but primarily study an alternative approach: one in which model-based predictions are used not as substitutes for human data, but as auxiliary measurements within randomized experiments. We formalize the inference of causal estimands from mixed-subjects randomized controlled trials, in which outcomes are observed for a subset of units while predictions are available for all units. Under transparent design conditions, we derive a family of estimators that remain unbiased for the average treatment effect in finite samples while exploiting predictions to reduce variance. We characterize when prediction-powered, calibration-based, arm-specifically tuned, and difference-in-predictions estimators improve precision, and we provide a software package which operationalizes these results and aids researchers to jointly select estimators and allocate budgets between human data collection and prediction generation. Together, our results show how generative artificial intelligence can improve experimental social science without compromising scientific validity.

Article activity feed