Synthetic Participants Generated by Large Language Models: A Systematic Literature Review
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In recent years, the prospect of Large Language Models (LLMs) for simulating participants within various research and data collection methods has been interrogated extensively. Its proponents cite aspirational promises, including high flexibility, adaptability, better representation and reduced research costs, all by leveraging the encoded wisdom of the internet crowd. Empirical studies paint a more nuanced but fragmented picture, with mixed results, heterogeneous methods and a saturation of different perspectives. In this systematic literature review, we delineate a clear and comprehensive conceptual understanding of LLM-generated participants and their comparative relationship to human samples. We synthesize the findings from 182 studies, obtained through a hybrid database and reference search, followed by a rigorous quality curation. Grounded in generalizable indicators, we present a standardized categorization of four fundamental issues that impact synthetic participants across diverse types of simulations – cognitive misalignments, distortions, misleading believability, and overfitting/contamination. Despite the survey revealing integrations of different LLMs, prompt engineering techniques, and participant or environment modeling methods, the fidelity improvements they demonstrated remain modest. At their most representative, LLMs may stochastically parrot data they were pre-trained on or fine-tuned with. To set appropriate expectations, explain their limitations and inform future applications, we propose the framing of synthetic participants as heuristic-like. Additionally, we discuss evaluation measures, specific supplemental roles that synthetic participants can be valid for, the underexplored potential of augmentative approaches, as well as a critical professional, social and ethical consideration of simulated insights.