Denoising Neural Models with Spectral QUEnching and Eigenvalue ZEroing (SQUEEZE)

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

\Glspl{llm} built with Transformer technology perform very well but requires substantial computational power. This paper presents \gls{squeeze}, a new framework that treats model compression as a signal-to-noise separation task. Using principles from \gls{rmt}, \gls{squeeze} identifies and preserves structural signals within weight matrices while discarding components that align with random noise. Unlike traditional methods, this approach serves as a post-training transformation that does not require retraining the model. Our analysis of fine-tuned \texttt{BERT-base} models reveals that matrix aspect ratios ($\beta$) significantly influence spectral behavior: rectangular feed-forward (\gls{ffn}) layers ($\beta =$ 0.25) adhere closely to the \gls{mp} and exhibit substantial redundancy, whereas square attention matrices ($\beta =$ 1) and highly rectangular embedding matrices ($\beta \ll$ 1) show significant departures from the null model. We implement a five-step pipeline -- standardization, \gls{svd} decomposition, baseline establishment, \gls{tw} finite-size adjustment, and rank truncation. Testing across three \gls{glue} tasks shows that \gls{squeeze} reduces model size by \SI{8.1}{\percent} with an accuracy loss of less than \SI{2.5}{\percent}. Specifically, targeting \gls{ffn} layers allows for a \SI{64}{\percent} reduction in effective rank while maintaining a cosine similarity of over 0.80. Our results show that spectral geometry is a critical factor in Transformer compressibility, positioning \gls{squeeze} as a high-fidelity alternative to aggressive methods such as quantization when model quality is the primary priority.

Article activity feed