Deterministic Compliance Failures in Large Language Models

Toyoji Kanagawa

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

About forty years ago, the author implemented a fuzzy inference engine in C on an MS-DOS platform as part of a master's thesis on PID auto-tuning control. In 2026, an attempt to reconstruct that system through LLM-assisted pair programming failed: despite detailed specifications and confirmed comprehension, the model repeatedly and autonomously rewrote deterministic logic, rendering the codebase unrecoverable. The author completed the implementation alone. This experience generated a hypothesis — that large language models lack the architectural capacity for deterministic specification compliance — and motivated the formal investigation reported here. Six state-of-the-art LLMs were evaluated on their ability to execute the original fuzzy inference specification through zero-shot reasoning, without code generation. The task required faithful table lookup, discrete state maintenance, priority-ordered early-exit logic, and mandatory multi-label aggregation — properties that are, in several critical respects, antithetical to autoregressive language generation. A structured white-box output format was mandated, enabling precise localization of any deviation to a specific inference step. A follow-up replication study confirmed these findings, revealing that even when a model acknowledges its prior errors or demonstrates meta-cognitive understanding of the rules, it consistently fails to maintain deterministic execution. One model's failure mode even evolved from passive evidence suppression to active data tampering—arbitrarily re-interpreting input values to justify a biased output. Five of six models failed to achieve full compliance. The failures were not random: seven categorically distinct failure mechanisms were identified across nine observed failures, including ancillary condition misread, multi-label detection failure, table misread, probabilistic override of explicitly dominant membership grades, aggregation rule bypass, output element hallucination, and termination rule violation driven by format-compliance rationalization. Critically, the model that recorded a 0% pass rate in the formal evaluation was the same model whose pair-programming behavior had originally motivated this research — a convergence of experimental data and practical observation that the author terms longitudinal validation. One model, Claude Sonnet 4.6, achieved perfect compliance across all test cases, exhibiting none of the identified failure modes. Its behavioral profile suggests a qualitatively different relationship to specification documents — one in which explicit instructions function as binding constraints rather than contextual suggestions. The findings support a definitive engineering conclusion: LLMs that exhibit any of the identified failure modes under controlled evaluation conditions are categorically unsuitable for autonomous deterministic automation, regardless of general capability ratings. The boundary of responsible LLM deployment is not defined by average performance. It is defined by the nature of the failures that occur at the margin.

Version published to 10.31224/6702
Mar 30, 2026

Automating Best-Practice Refactoring in Java via Multi-Agent Planning and Verification

This article has 4 authors:
1. Jian Yang
2. Jing Li
3. Yuanyuan Gao
4. Jiao Jiao
This article has no evaluationsLatest version Apr 13, 2026
AgentVerify: Compositional Formal Verification of AI Agent Safety Properties via LTL Model Checking

This article has 1 author:
1. Eric Fang
This article has no evaluationsLatest version Apr 14, 2026
Grammar-Guided Incremental Method for Efficient LLM-Generated Code Execution

This article has 2 authors:
1. Anton Svystunov
2. Yaroslav Tereshchenko
This article has no evaluationsLatest version Apr 2, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Automating Best-Practice Refactoring in Java via Multi-Agent Planning and Verification

AgentVerify: Compositional Formal Verification of AI Agent Safety Properties via LTL Model Checking

Grammar-Guided Incremental Method for Efficient LLM-Generated Code Execution