The Dice Roll Method: A Standardized Protocol for Measuring Stochastic Bias in Large Language Model Outputs
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Researchers increasingly use repeated identical prompts to measure stochastic bias in large language model (LLM) outputs, yet no standardized protocol exists for determining adequate iteration counts, selecting appropriate stability metrics, or establishing reliability thresholds. This paper formalizes the \emph{Dice Roll Method} as a reusable audit protocol through meta-methodology combining reanalysis of five empirical studies (approximately 190,000 observations across three to five LLMs, 270+ brands, 6 languages, and iteration counts from 5 to 40) with Monte Carlo power simulation (10,000 replications per condition). Our power analysis demonstrates that $n = 5$ iterations achieves adequate statistical power ($> 0.80$) only for large effects (Cohen's $d > 0.8$); medium effects ($d = 0.5$) require $n \geq 15$, and small effects ($d = 0.2$) remain undetectable below $n = 40$. Metric convergence follows a logarithmic curve, with 80\% of asymptotic precision achieved by $n = 7$ and 90\% by $n = 10$. Test-retest reliability reaches acceptable levels (ICC $\geq 0.70$) at $n \geq 8$ for brand count means. Cross-metric correlation analysis reveals that count-based metrics (coefficient of variation, Gini coefficient) and embedding-based metrics (cosine similarity) capture partially orthogonal information (Spearman $r = 0.31$--$0.47$), supporting complementary metric batteries. We provide power analysis lookup tables, a metric selection decision tree, and a reproducible Python implementation. These findings establish minimum methodological standards for LLM auditing research and enable researchers to justify iteration counts through formal power analysis.