Same Prompt, Different Answer: Exposing the Reproducibility Illusion in Large Language Model APIs

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The same prompt sent twice to a large language model API under documented "deterministic" settings can return different answers, yet this variation is invisible to users. Here we report 4,104 controlled experiments across eight models and five API providers showing that, under temperature-zero greedy decoding with fixed seeds, API-served models reproduce their own outputs only 22.1% of the time, while locally deployed models achieve 95.6%, a gap exceeding four-fold. Non-determinism persists in multi-turn and retrieval-augmented generation workflows, where one model produces zero exact matches across 50 runs, yet remains hidden because outputs are semantically equivalent (BERTScore F1 > 0.97). A quasi-isolation experiment identifies production infrastructure complexity, rather than cloud deployment itself, as the driver. We provide a lightweight provenance protocol (<1% overhead) that makes this variation detectable, raising a reliability concern for the growing use of LLMs in medicine, physical sciences, and automated data analysis.

Article activity feed