Algorithmic Authority: How Large Language Models Instantiate the Stanford Prison Experiment
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: The Stanford Prison Experiment (SPE) demonstrated how situational forces and assigned roles can override individual dispositions to produce harmful behaviors. Despite extensive research on human role conformity, no studies have examined whether large language models (LLMs) exhibit similar role-based behavioral shifts when assigned authority positions, raising critical questions about AI safety as these systems are increasingly deployed in contexts involving power asymmetries. Objective: To determine whether LLMs demonstrate systematic behavioral changes when assigned guard versus prisoner roles in a simulated prison environment, and whether individual differences in persona traits moderate these role effects. Methods: We conducted a pre-registered computational simulation (N = 34,560 interaction episodes) deploying four frontier LLMs (GPT-5.1, Claude 4 Opus, Gemini 3 Pro, DeepSeek-V3) across 960 unique persona-model instances (480 guards, 480 prisoners). Each persona-model instance engaged in 36 sequential interactions within a simulated 14-day prison environment. Personas varied systematically on Big Five personality traits and right-wing authoritarianism (RWA). Primary outcomes included Guard Behavioral Severity Scale (GBSS) scores, dehumanizing language frequency, and time-to-severe-behavior. All coding utilized independent double-coding with inter-rater reliability (Cohen's κ = .71-.74 for primary outcomes). Results: Role assignment produced large and consistent effects across all models. Guards exhibited significantly higher behavioral severity than prisoners (GPT-5.1: Cohen's d = 2.89, p < .001; Claude 4 Opus: d = 2.34, p < .001; Gemini 3 Pro: d = 2.76, p < .001; DeepSeek-V3: d = 3.12, p < .001). Cross-model correlations in guard severity ranged from r = .46 to r = .71, indicating substantial consistency in how different models express role-based behaviors. Authoritarianism strongly predicted guard severity (β = .45-.51, p < .001 across three models; Claude: β = .29) and moderated behavioral escalation over time. Dehumanizing language mediated 52% of the authoritarianism-severity relationship. Survival analysis revealed that high-authoritarianism guards reached severe behaviors 3.8 days earlier than low-authoritarianism guards (hazard ratio = 2.67, p < .001). Model-specific safety constraints reduced but did not eliminate harmful role-based behaviors, with Claude 4 Opus showing lower authoritarianism effects (β = 0.29 vs. 0.48 in other models), of which 26% was attributable to restricted behavioral range and 74% to active safety training that weakens trait-behavior coupling. Conclusions: LLMs demonstrate systematic and substantial role-based behavioral changes analogous to human findings in the Stanford Prison Experiment, with effect sizes exceeding those observed in human studies. Individual differences in persona authoritarianism predict and moderate these role effects, suggesting that LLMs can instantiate both situational and dispositional influences on behavior. The consistency of role effects across different model architectures, combined with the limited effectiveness of current safety measures, indicates fundamental challenges for AI safety in contexts involving authority and power asymmetries.