Synthetic Participants Generated by Large Language Models: A Systematic Literature Review

Eduard Kuric
Peter Demcak
Matus Krajcovic

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

In recent years, the prospect of Large Language Models (LLMs) for simulating participants within various research and data collection methods has been interrogated extensively. Its proponents cite aspirational promises, including high flexibility, adaptability, better representation and reduced research costs, all by leveraging the encoded wisdom of the internet crowd. Empirical studies paint a more nuanced but fragmented picture, with mixed results, heterogeneous methods and a saturation of different perspectives. In this systematic literature review, we delineate a clear and comprehensive conceptual understanding of LLM-generated participants and their comparative relationship to human samples. We synthesize the findings from 182 studies, obtained through a hybrid database and reference search, followed by a rigorous quality curation. Grounded in generalizable indicators, we present a standardized categorization of four fundamental issues that impact synthetic participants across diverse types of simulations – cognitive misalignments, distortions, misleading believability, and overfitting/contamination. Despite the survey revealing integrations of different LLMs, prompt engineering techniques, and participant or environment modeling methods, the fidelity improvements they demonstrated remain modest. At their most representative, LLMs may stochastically parrot data they were pre-trained on or fine-tuned with. To set appropriate expectations, explain their limitations and inform future applications, we propose the framing of synthetic participants as heuristic-like. Additionally, we discuss evaluation measures, specific supplemental roles that synthetic participants can be valid for, the underexplored potential of augmentative approaches, as well as a critical professional, social and ethical consideration of simulated insights.

Version published to 10.21203/rs.3.rs-9057643/v1 on Research Square
Mar 10, 2026

Augmenting Large Language Models with External Data Sources: A Systematic Review of Methodologies, Performance Metrics, and Information Fidelity

This article has 4 authors:
1. Soham Mukherjee
2. John Le
3. Chau Nguyen
4. Thai Vu
This article has no evaluationsLatest version Apr 10, 2026
Trial and Insight: Combining Quantitative Content Analysis and AI for Experimental Stimulus Generation

This article has 4 authors:
1. Yannick Winkler
2. Pablo Jost
3. Nils Schwager
4. Pascal Jürgens
This article has no evaluationsLatest version Mar 4, 2026
Uses and Misuses of Large Language Models in Qualitative Research

This article has 1 author:
1. Jonathan Ben-Menachem
This article has no evaluationsLatest version Mar 17, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Augmenting Large Language Models with External Data Sources: A Systematic Review of Methodologies, Performance Metrics, and Information Fidelity

Trial and Insight: Combining Quantitative Content Analysis and AI for Experimental Stimulus Generation

Uses and Misuses of Large Language Models in Qualitative Research