Denoising Neural Models with Spectral QUEnching and Eigenvalue ZEroing (SQUEEZE)
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
\Glspl{llm} built with Transformer technology perform very well but requires substantial computational power. This paper presents \gls{squeeze}, a new framework that treats model compression as a signal-to-noise separation task. Using principles from \gls{rmt}, \gls{squeeze} identifies and preserves structural signals within weight matrices while discarding components that align with random noise. Unlike traditional methods, this approach serves as a post-training transformation that does not require retraining the model. Our analysis of fine-tuned \texttt{BERT-base} models reveals that matrix aspect ratios ($\beta$) significantly influence spectral behavior: rectangular feed-forward (\gls{ffn}) layers ($\beta =$ 0.25) adhere closely to the \gls{mp} and exhibit substantial redundancy, whereas square attention matrices ($\beta =$ 1) and highly rectangular embedding matrices ($\beta \ll$ 1) show significant departures from the null model. We implement a five-step pipeline -- standardization, \gls{svd} decomposition, baseline establishment, \gls{tw} finite-size adjustment, and rank truncation. Testing across three \gls{glue} tasks shows that \gls{squeeze} reduces model size by \SI{8.1}{\percent} with an accuracy loss of less than \SI{2.5}{\percent}. Specifically, targeting \gls{ffn} layers allows for a \SI{64}{\percent} reduction in effective rank while maintaining a cosine similarity of over 0.80. Our results show that spectral geometry is a critical factor in Transformer compressibility, positioning \gls{squeeze} as a high-fidelity alternative to aggressive methods such as quantization when model quality is the primary priority.