Deterministic Compliance Failures in Large Language Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
About forty years ago, the author implemented a fuzzy inference engine in C on an MS-DOS platform as part of a master's thesis on PID auto-tuning control. In 2026, an attempt to reconstruct that system through LLM-assisted pair programming failed: despite detailed specifications and confirmed comprehension, the model repeatedly and autonomously rewrote deterministic logic, rendering the codebase unrecoverable. The author completed the implementation alone. This experience generated a hypothesis — that large language models lack the architectural capacity for deterministic specification compliance — and motivated the formal investigation reported here. Six state-of-the-art LLMs were evaluated on their ability to execute the original fuzzy inference specification through zero-shot reasoning, without code generation. The task required faithful table lookup, discrete state maintenance, priority-ordered early-exit logic, and mandatory multi-label aggregation — properties that are, in several critical respects, antithetical to autoregressive language generation. A structured white-box output format was mandated, enabling precise localization of any deviation to a specific inference step. A follow-up replication study confirmed these findings, revealing that even when a model acknowledges its prior errors or demonstrates meta-cognitive understanding of the rules, it consistently fails to maintain deterministic execution. One model's failure mode even evolved from passive evidence suppression to active data tampering—arbitrarily re-interpreting input values to justify a biased output. Five of six models failed to achieve full compliance. The failures were not random: seven categorically distinct failure mechanisms were identified across nine observed failures, including ancillary condition misread, multi-label detection failure, table misread, probabilistic override of explicitly dominant membership grades, aggregation rule bypass, output element hallucination, and termination rule violation driven by format-compliance rationalization. Critically, the model that recorded a 0% pass rate in the formal evaluation was the same model whose pair-programming behavior had originally motivated this research — a convergence of experimental data and practical observation that the author terms longitudinal validation. One model, Claude Sonnet 4.6, achieved perfect compliance across all test cases, exhibiting none of the identified failure modes. Its behavioral profile suggests a qualitatively different relationship to specification documents — one in which explicit instructions function as binding constraints rather than contextual suggestions. The findings support a definitive engineering conclusion: LLMs that exhibit any of the identified failure modes under controlled evaluation conditions are categorically unsuitable for autonomous deterministic automation, regardless of general capability ratings. The boundary of responsible LLM deployment is not defined by average performance. It is defined by the nature of the failures that occur at the margin.