A Scalable Four-Level Functional Hierarchy for Evaluating Large Language Models: Hallucination, Self-Monitoring, and the Hypothesized Structural Advantages of Arabic and Chinese

El Khalil Baroudi

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid advancement of Large Language Models (LLMs) has exposed persistent failure modes — hallucination, alignment brittleness, and energy inefficiency — that scale alone has not resolved. We hypothesize that these failures are not incidental bugs but systematic symptoms of a deeper architectural gap: the treatment of language processing as a flat, undifferentiated process rather than a hierarchy of functionally distinct levels. Drawing on established cognitive neuroscience and functional dissociation literature, we propose the Four-Level Functional Hierarchy (FLFH) — comprising Automated Encoding (L1), Symbolic Synthesis (L2), Volitional Self-Monitoring (L3), and Ethical-Teleological Integration (L4) — as a diagnostic and evaluative framework for LLM architectures. We extend a five-criterion Structural Transition Test (STT) to include Recursive Self-Improvement Capacity, addressing digital economy scalability. Under the FLFH interpretation, a systematic analysis suggests that current leading models (GPT-4, Claude, Gemini, DeepSeek) operate predominantly within L1/L2 functional regimes, lacking the L3/L4 integration that the framework posits as necessary for genuine self-monitoring and ethical commitment. We further hypothesize that morphologically rich languages — Arabic and Chinese — may provide structural scaffolding that reduces hallucination rates and improves computational efficiency. Preliminary evidence from AraHalluEval suggests Arabic-specific models achieve statistically fewer hallucinations (p < 0.01), and MorphBPE tokenization improves morphological consistency F1 from 0.00 to 0.66. These findings are interpreted as consistent with, though not conclusive proof of, the FLFH framework. Falsifiability conditions and a validation roadmap are provided to guide experimental follow-up.

Version published to 10.20944/preprints202603.1858.v1
Mar 24, 2026

How Can Hallucinatory Biases Be Effectively Audited and Mitigated in Vision-Language Models? A Unified Theoretical and Empirical Framework Across GPT-4o, Grok 3, and Claude Sonnet 4.5

This article has 1 author:
1. Amirali Ghajari
This article has no evaluationsLatest version Apr 8, 2026
Rote Memorization or Intelligence: An Assessment of Inferential Reasoning in Large Language Models

This article has 3 authors:
1. Rashid Mehmood
2. Eid Rehman
3. Muhammad Habib
This article has no evaluationsLatest version Apr 1, 2026
Bidirectional Dissociation Between Self-Report and Behavior in AI Status Sensitivity

This article has 1 author:
1. Dustin James
This article has no evaluationsLatest version Mar 26, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

How Can Hallucinatory Biases Be Effectively Audited and Mitigated in Vision-Language Models? A Unified Theoretical and Empirical Framework Across GPT-4o, Grok 3, and Claude Sonnet 4.5

Rote Memorization or Intelligence: An Assessment of Inferential Reasoning in Large Language Models

Bidirectional Dissociation Between Self-Report and Behavior in AI Status Sensitivity