Same Prompt, Different Answer: Exposing the Reproducibility Illusion in Large Language Model APIs

Lucas Rover
Hugo Siqueira
Anibal Azevedo
Eduardo Bacalhau
Yara Tadano

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The same prompt sent twice to a large language model API under documented "deterministic" settings can return different answers, yet this variation is invisible to users. Here we report 4,104 controlled experiments across eight models and five API providers showing that, under temperature-zero greedy decoding with fixed seeds, API-served models reproduce their own outputs only 22.1% of the time, while locally deployed models achieve 95.6%, a gap exceeding four-fold. Non-determinism persists in multi-turn and retrieval-augmented generation workflows, where one model produces zero exact matches across 50 runs, yet remains hidden because outputs are semantically equivalent (BERTScore F1 > 0.97). A quasi-isolation experiment identifies production infrastructure complexity, rather than cloud deployment itself, as the driver. We provide a lightweight provenance protocol (<1% overhead) that makes this variation detectable, raising a reliability concern for the growing use of LLMs in medicine, physical sciences, and automated data analysis.

Version published to 10.21203/rs.3.rs-9096283/v1 on Research Square
Mar 13, 2026

Bridging Developer–QA Gaps Using Large Language Models and Automation: A Pilot Evaluation of AutoVisQA

This article has 1 author:
1. Tanvir Hasan
This article has no evaluationsLatest version Apr 17, 2026
Cannot, Should Not, Did Anyway: Benchmarking Constraint Enforcement Failure in Frontier LLMs

This article has 2 authors:
1. Samir M. Haq
2. Shehni Nadeem
This article has no evaluationsLatest version May 24, 2026
Merging LoRA Adapters for Multi-Task Code Analysis: An Empirical Study of Linear Combination and Task Interference

This article has 2 authors:
1. Sankalp Pathak
2. Sanjay Garg
This article has no evaluationsLatest version Apr 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Bridging Developer–QA Gaps Using Large Language Models and Automation: A Pilot Evaluation of AutoVisQA

Cannot, Should Not, Did Anyway: Benchmarking Constraint Enforcement Failure in Frontier LLMs

Merging LoRA Adapters for Multi-Task Code Analysis: An Empirical Study of Linear Combination and Task Interference