Simulating Lay Health-Seeking Behavior with LLM Personas and Illness Vignettes: Reproducibility, Prompt Sensitivity, and Slice Dependence

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language models (LLMs) are increasingly used as “synthetic respondents” to simulate human judgments and decision-making. In healthcare-adjacent settings, a key methodological risk is that simulated behavior may be sensitive to prompt framing, run-to-run stochasticity, and the slice of scenarios being tested (e.g., red-flag vs non–red-flag situations). We present a fully synthetic, non-human-subject study that simulates a layperson persona choosing a next action when experiencing an illness vignette, using a fixed action codebook (A0–A9). In a Pilot experiment (40 persona–scenario pairs; 2 prompt variants; 3 repeats), the model produced plausibly monotonic action urgency as vignette severity increased and showed moderate run-to-run agreement (mean agreement 0.617). However, prompt comparisons performed within the same batch produced perfect agreement between prompts (0/40 mismatches), indicating that within-batch paired designs can underestimate prompt sensitivity. In an isolated-prompt audit (24 pairs), the action mismatch rate between prompts varied substantially across runs (0.0% to 45.8%). Prompt sensitivity was slice-dependent: mismatch was low in mild non–red-flag scenarios (8.3%) but high in red-flag scenarios (41.7%). A stress test using a stronger rubric shifted the action distribution (JS divergence 0.130) and reduced mean urgency by 1.29 points. These findings motivate multi-run, slice-aware evaluations when using LLM personas to simulate health-seeking behavior.

Article activity feed