Spectral-Profile-Aware Low-Rank Compression for GPU Memory and Bandwidth Optimization
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
GPU workloads in high-performance computing (HPC) and machine-learning inference are often limited by memory capacity and memory bandwidth rather than floating-point throughput. Low-rank factorization is a common strategy for reducing storage and memory traffic, yet it can also fail—sometimes increasing the memory footprint—when the rank required at a target error tolerance is too large. This paper makes the success/failure boundary explicit at the level of the singular spectrum. Using the Eckart–Young–Mirsky optimality identity and a minimal memory-traffic model, we relate the required rank k(ε) (for relative Frobenius tolerance ε) to the tail class of the singular values. We derive closed-form scalings for canonical tails and obtain a practical, vendor-agnostic decision rule: estimate the spectral tail, predict k(ε), and compress only when the predicted representation is memorypositive. A fully reproducible benchmark (Python script + CSV outputs) and a case study at N = 4096 illustrate the main point: at ε = 0.1 an exponential spectrum requires k = 16 and yields ∼ 1.3 × 102 storage reduction, whereas a borderline heavy tail requires k ≈ 3748 and yields no reduction. We also show how the achievable reduction scales with N at fixed tolerance.