A Scalable Four-Level Functional Hierarchy for Evaluating Large Language Models: Hallucination, Self-Monitoring, and the Hypothesized Structural Advantages of Arabic and Chinese

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rapid advancement of Large Language Models (LLMs) has exposed persistent failure modes — hallucination, alignment brittleness, and energy inefficiency — that scale alone has not resolved. We hypothesize that these failures are not incidental bugs but systematic symptoms of a deeper architectural gap: the treatment of language processing as a flat, undifferentiated process rather than a hierarchy of functionally distinct levels. Drawing on established cognitive neuroscience and functional dissociation literature, we propose the Four-Level Functional Hierarchy (FLFH) — comprising Automated Encoding (L1), Symbolic Synthesis (L2), Volitional Self-Monitoring (L3), and Ethical-Teleological Integration (L4) — as a diagnostic and evaluative framework for LLM architectures. We extend a five-criterion Structural Transition Test (STT) to include Recursive Self-Improvement Capacity, addressing digital economy scalability. Under the FLFH interpretation, a systematic analysis suggests that current leading models (GPT-4, Claude, Gemini, DeepSeek) operate predominantly within L1/L2 functional regimes, lacking the L3/L4 integration that the framework posits as necessary for genuine self-monitoring and ethical commitment. We further hypothesize that morphologically rich languages — Arabic and Chinese — may provide structural scaffolding that reduces hallucination rates and improves computational efficiency. Preliminary evidence from AraHalluEval suggests Arabic-specific models achieve statistically fewer hallucinations (p < 0.01), and MorphBPE tokenization improves morphological consistency F1 from 0.00 to 0.66. These findings are interpreted as consistent with, though not conclusive proof of, the FLFH framework. Falsifiability conditions and a validation roadmap are provided to guide experimental follow-up.

Article activity feed