Decoding Transformers Spectra: A Random Matrix Theory Framework Beyond the Marchenko–Pastur Law
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The rapid expansion of large language models (LLMs) has intensified the demand for principled methodologies capable of decoding their internal structure and guiding efficient deployment. Although Transformers achieve state-of-the-art performance, the large linear operators that compose their architectures - such as attention projections, feed-forward layers, and embeddings - are represented by weight matrices that often contain substantial redundancy and noise. To address this, we develop a Random Matrix Theory (RMT) framework that systematically analyzes the spectral behavior of Transformer weight matrices beyond the classical Marchenko–Pastur law. The framework integrates Marchenko–Pastur baselines, bootstrap calibration, and shrinkage transformations to disentangle noise from structured signal in high-dimensional spectra. The objective of this study is to characterize the bulk–plus–spike organization, edge fluctuations, and finite-sample deviations observed in Transformer spectra, thereby establishing a rigorous methodology to guide spectral denoising, shrinkage, and compression strategies. Our empirical analysis reveals that feed-forward layers conform more closely to Marchenko–Pastur predictions, while attention and embedding layers display pronounced edge deviations consistent with Tracy–Widom statistics. These findings yield a taxonomy of layer-specific spectral behavior, linking empirical spectra to theoretical distributions and highlighting distinctive roles across components. Overall, this work positions RMT-based spectral decoding as both a rigorous and practical tool for analyzing modern deep learning models, providing methodological insights into robustness, generalization, and compressibility in Transformer architectures.