Simulating Lay Health-Seeking Behavior with LLM Personas and Illness Vignettes: Reproducibility, Prompt Sensitivity, and Slice Dependence

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language models (LLMs) are increasingly used as “synthetic respondents” to simulate human judgments and decision-making. In healthcare-adjacent settings, a key methodological risk is that simulated behavior may be sensitive to prompt framing, stochastic decoding, and the scenario slice being tested (e.g., red-flag vs non–red-flag situations). We present a fully synthetic, non-human-subject methodological audit in which an LLM, conditioned on a fictional layperson persona, selects a next-action code (A0–A9) for an illness vignette. In a Pilot experiment (40 persona–scenario pairs; two prompt variants; three repeats), action urgency increased with vignette severity and repeatability was moderate (mean modal agreement 0.617). However, within-batch paired prompt comparisons yielded perfect agreement (0/40 mismatches), suggesting that paired designs that do not enforce independence can severely underestimate prompt sensitivity. In an isolated-prompt audit (24 pairs; three repeats), prompt mismatch varied widely across replications (0.0% to 45.8%). To disentangle prompt effects from decoding noise, we performed a controlled follow-up rerun on two slices (non–red-flag and red-flag; 24 pairs each) under explicit decoding settings (temperature 0 vs default temperature 1.0/top-p 0.95) with 8–10 repeats per condition. Prompt sensitivity remained high under both decoding regimes (mean action mismatch 0.787–0.821; mean Jensen–Shannon divergence 0.148–0.196 bits), while near-deterministic decoding improved run-to-run stability in three of four settings. A rubric stress test shifted the action distribution (Jensen–Shannon divergence 0.130) and reduced mean urgency by 1.29 points. Together, these results motivate multi-run, slice-aware, decoding-controlled evaluations when using LLM personas for behavioral simulation.

Article activity feed