A Unified Perspective on Efficient Attention: Generalized Memory and Kernel Function Selection in Transformers
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This article introduces a new theoretical perspective on the Scaled-Dot-Product Attention (SDPA) in transformers by connecting it to distributed memory theory. We propose that SDPA is an extension of distributed memory and reframe it as a Generalized Memory Model, a mechanism for learning multi-way associations that expands upon classical frameworks. Our experimental findings validate this perspective and yield practical methods for building more efficient attention mechanisms. We demonstrate that model convergence is maintained even when query and key vectors are identical, a modification that halves their memory requirement and offers a significant efficiency gain for long sequences. Furthermore, our comprehensive kernel analysis shows that all tested kernels, including a simple linear kernel, provide a path to convergence, establishing a strong baseline for attention approximation. While the Radial Basis Function (RBF) kernel offers marginal improvements, we also introduce several kernels that successfully converge with normalized weights. Collectively, this work provides both a unifying theoretical lens for transformer attention and a practical guide to kernel selection for developing more robust and efficient models.