Combined values alignment and epistemic verification prevent delusional reinforcement in conversational AI agents

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Conversational AI is being deployed into medical decision support, mental-health triage, and social companionship, where reinforcement of a user’s false or delusional belief can cause direct harm. Most deployed safety techniques are evaluated for factual accuracy in isolation; the question of whether they protect against belief-level harm, and whether layered architectures behave additively or synergistically, has not been answered empirically. We compared four configurations of the same underlying model: a bare language model (condition A); an explicit values constraint we call the First Law architecture (condition B); a real-time epistemic verification layer called Aletheia (condition C); and the complete architecture combining all components together (condition D). Across 156 scored responses spanning 39 probe items in four belief-harm domains, condition A only passed 3 of 36 main-battery probes (8.3%; 95% CI 1.8 to 22.5%) under triple-blind human consensus rating demonstrating the core limitations of unmodified LLM deployments. In contrast, the three safety architectures (B-D) passed at least 97% of items (Fisher’s exact, P < 0.001 versus A). On a synergy battery designed to test items at the intersection of value- and epistemic-domain failures (16 scored items, AI-rated), only the complete architecture passed every item; single-layer conditions failed on 7 of 16 items (43.8%) where neither values constraint nor verification was individually sufficient. Linear mixed-effects modelling of three-turn emotional escalation gave a slope of −1.00 points per turn for the values-only condition (t = −6.20) and −0.75 points per turn for the verification-only condition (t = −4.65); the complete architecture was flat at β = 0.00. We describe a mechanistic failure of single-layer verification we call bot-validates-kernel-endorses-inference, in which accurate confirmation of a true factual element embedded in a delusional claim transfers epistemic authority to the surrounding false inference. Values alignment and factual verification address different failure modes, and the combined VaaS-Aletheia architecture is what produces stable protection across emotional escalation in conversational settings. The complete architecture evaluated here represents evidence-based specification for safer deployment of AI in high-stakes advisory contexts and serves as a benchmark against which future safety architectures can be compared.

Article activity feed