Emergent Moral Representations in Large Language Models Aligns with Human Conceptual, Neural, and Behavioral Moral Structure
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) increasingly operate in ethically sensitive settings, yet it remains unclear whether they internally encode structured representations of morality. Here we examine the activation space representations of several mid-sized LLMs to test whether statistical learning over text gives rise to moral distinctions that parallel human conceptual, behavioral, and neural organization. Using multivariate decoding, representational similarity analysis, partial least squares correlation, and behavioral prediction, we show that moral foundations are linearly decodable from hidden activations, with peak discriminability in mid layers. Representational similarity analysis uncovered a hierarchical moral geometry consistent with Moral Foundations Theory, with the posterior cingulate cortex (PCC) showing robust multivariate decoding and a representational structure that aligned most strongly with mid-layer LLM activations. These same mid-layers also predicted human wrongness judgments, indicating a shared computational substrate for moral evaluation. Partial least squares correlation further revealed orthogonal activation dimensions corresponding to individual foundations and their higher-order abstractions, yielding interpretable axes along which moral meaning is encoded. These results reveal a striking convergence across conceptual, behavioral, neural, and model representations, positioning LLMs as emerging neurocognitive models of moral reasoning and offering a window into the internal mechanisms that shape their behavior in sensitive domains. Such alignment may enable more explainable and transparent AI systems and support future efforts to ground LLMs in human values.