A Unified Perspective on Efficient Attention: Generalized Memory and Kernel Function Selection in Transformers

Jiyong Ma

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This article introduces a new theoretical perspective on the Scaled-Dot-Product Attention (SDPA) in transformers by connecting it to distributed memory theory. We propose that SDPA is an extension of distributed memory and reframe it as a Generalized Memory Model, a mechanism for learning multi-way associations that expands upon classical frameworks. Our experimental findings validate this perspective and yield practical methods for building more efficient attention mechanisms. We demonstrate that model convergence is maintained even when query and key vectors are identical, a modification that halves their memory requirement and offers a significant efficiency gain for long sequences. Furthermore, our comprehensive kernel analysis shows that all tested kernels, including a simple linear kernel, provide a path to convergence, establishing a strong baseline for attention approximation. While the Radial Basis Function (RBF) kernel offers marginal improvements, we also introduce several kernels that successfully converge with normalized weights. Collectively, this work provides both a unifying theoretical lens for transformer attention and a practical guide to kernel selection for developing more robust and efficient models.

Version published to 10.20944/preprints202511.0515.v1
Nov 7, 2025

Integrating Convexity and DeepLabv3+ for SemanticSegmentation of Power Lines

This article has 6 authors:
1. Ying Hou
2. Jindou Tuo
3. Haochen Li
4. Zhuoxin Yan
5. Xuemei Huang
6. Yuqi Wang
This article has no evaluationsLatest version Jan 21, 2026
SHIT: A Negative Adaptive Attention Model in Few Shot Learning Capability Named APA

This article has 2 authors:
1. Yaolin Zhang
2. Pengrong Huang
This article has no evaluationsLatest version Jan 23, 2026
Lite-FARNet: A Light-weight Feedback Attention Residual Network for Efficient Multi-Class Segmentation in Complex Urban Scenes

This article has 3 authors:
1. Jiaxi Yang
2. Jiaquan Shen
3. Shitong Wang
This article has no evaluationsLatest version Dec 23, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Integrating Convexity and DeepLabv3+ for SemanticSegmentation of Power Lines

SHIT: A Negative Adaptive Attention Model in Few Shot Learning Capability Named APA

Lite-FARNet: A Light-weight Feedback Attention Residual Network for Efficient Multi-Class Segmentation in Complex Urban Scenes