Simulating Lay Health-Seeking Behavior with LLM Personas and Illness Vignettes: Reproducibility, Prompt Sensitivity, and Slice Dependence

Yuusuke Harada

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) are increasingly used as “synthetic respondents” to simulate human judgments and decision-making. In healthcare-adjacent settings, a key methodological risk is that simulated behavior may be sensitive to prompt framing, stochastic decoding, and the scenario slice being tested (e.g., red-flag vs non–red-flag situations). We present a fully synthetic, non-human-subject methodological audit in which an LLM, conditioned on a fictional layperson persona, selects a next-action code (A0–A9) for an illness vignette. In a Pilot experiment (40 persona–scenario pairs; two prompt variants; three repeats), action urgency increased with vignette severity and repeatability was moderate (mean modal agreement 0.617). However, within-batch paired prompt comparisons yielded perfect agreement (0/40 mismatches), suggesting that paired designs that do not enforce independence can severely underestimate prompt sensitivity. In an isolated-prompt audit (24 pairs; three repeats), prompt mismatch varied widely across replications (0.0% to 45.8%). To disentangle prompt effects from decoding noise, we performed a controlled follow-up rerun on two slices (non–red-flag and red-flag; 24 pairs each) under explicit decoding settings (temperature 0 vs default temperature 1.0/top-p 0.95) with 8–10 repeats per condition. Prompt sensitivity remained high under both decoding regimes (mean action mismatch 0.787–0.821; mean Jensen–Shannon divergence 0.148–0.196 bits), while near-deterministic decoding improved run-to-run stability in three of four settings. A rubric stress test shifted the action distribution (Jensen–Shannon divergence 0.130) and reduced mean urgency by 1.29 points. Together, these results motivate multi-run, slice-aware, decoding-controlled evaluations when using LLM personas for behavioral simulation.

Version published to 10.32388/be0zbc.2
Mar 29, 2026
Version published to 10.32388/be0zbc
Feb 27, 2026

Obedience to Unsafe Clinical Instructions: How Large Language Models Respond to Authority Cues

This article has 9 authors:
1. Mahmud Omar
2. Reem Agbareia
3. Jolion McGreevy
4. Alon Gorenshtein
5. Alexander Charney
6. Ankit Sakhuja
7. Benjamin S. Glicksberg
8. Girish Nadkarni
9. Eyal Klang
This article has no evaluationsLatest version Mar 18, 2026
The threat of synthetic respondents extends to clinical mental health screening

This article has 3 authors:
1. Kianté Fernandez
2. Laura Berner
3. Blair R K Shevlin
This article has no evaluationsLatest version Mar 6, 2026
Rethinking Medical LLM Hallucinations: A System-Level Survey

This article has 4 authors:
1. Asha Matthews
2. Vijay Vankadaru
3. Tanya Roosta
4. Peyman Passban
This article has no evaluationsLatest version Mar 23, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Obedience to Unsafe Clinical Instructions: How Large Language Models Respond to Authority Cues

The threat of synthetic respondents extends to clinical mental health screening

Rethinking Medical LLM Hallucinations: A System-Level Survey