The Dice Roll Method: A Standardized Protocol for Measuring Stochastic Bias in Large Language Model Outputs

Dmitrij Żatuchin

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Researchers increasingly use repeated identical prompts to measure stochastic bias in large language model (LLM) outputs, yet no standardized protocol exists for determining adequate iteration counts, selecting appropriate stability metrics, or establishing reliability thresholds. This paper formalizes the \emph{Dice Roll Method} as a reusable audit protocol through meta-methodology combining reanalysis of five empirical studies (approximately 190,000 observations across three to five LLMs, 270+ brands, 6 languages, and iteration counts from 5 to 40) with Monte Carlo power simulation (10,000 replications per condition). Our power analysis demonstrates that $n = 5$ iterations achieves adequate statistical power ($> 0.80$) only for large effects (Cohen's $d > 0.8$); medium effects ($d = 0.5$) require $n \geq 15$, and small effects ($d = 0.2$) remain undetectable below $n = 40$. Metric convergence follows a logarithmic curve, with 80\% of asymptotic precision achieved by $n = 7$ and 90\% by $n = 10$. Test-retest reliability reaches acceptable levels (ICC $\geq 0.70$) at $n \geq 8$ for brand count means. Cross-metric correlation analysis reveals that count-based metrics (coefficient of variation, Gini coefficient) and embedding-based metrics (cosine similarity) capture partially orthogonal information (Spearman $r = 0.31$--$0.47$), supporting complementary metric batteries. We provide power analysis lookup tables, a metric selection decision tree, and a reproducible Python implementation. These findings establish minimum methodological standards for LLM auditing research and enable researchers to justify iteration counts through formal power analysis.

Version published to 10.21203/rs.3.rs-8980233/v1 on Research Square
Mar 26, 2026

Type I Error Inflation in Unexpected Event During Survey Designs

This article has 2 authors:
1. Joris Frese
2. Sascha Riaz
This article has no evaluationsLatest version Apr 11, 2026
Uses and Misuses of Large Language Models in Qualitative Research

This article has 1 author:
1. Jonathan Ben-Menachem
This article has no evaluationsLatest version Mar 17, 2026
Distorting Effects of Optional Stopping with Bayes Factors are Minimal - A Commentary to Anderson et al., 2021

This article has 1 author:
1. Philipp Musfeld
This article has no evaluationsLatest version Feb 28, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Type I Error Inflation in Unexpected Event During Survey Designs

Uses and Misuses of Large Language Models in Qualitative Research

Distorting Effects of Optional Stopping with Bayes Factors are Minimal - A Commentary to Anderson et al., 2021