Same Prompt, Different Answer: Exposing the Reproducibility Illusion in Large Language Model APIs
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The same prompt sent twice to a large language model API under documented "deterministic" settings can return different answers, yet this variation is invisible to users. Here we report 4,104 controlled experiments across eight models and five API providers showing that, under temperature-zero greedy decoding with fixed seeds, API-served models reproduce their own outputs only 22.1% of the time, while locally deployed models achieve 95.6%, a gap exceeding four-fold. Non-determinism persists in multi-turn and retrieval-augmented generation workflows, where one model produces zero exact matches across 50 runs, yet remains hidden because outputs are semantically equivalent (BERTScore F1 > 0.97). A quasi-isolation experiment identifies production infrastructure complexity, rather than cloud deployment itself, as the driver. We provide a lightweight provenance protocol (<1% overhead) that makes this variation detectable, raising a reliability concern for the growing use of LLMs in medicine, physical sciences, and automated data analysis.