Theoretical Analysis of Positional Encodings in Transformer Models: Impact on Expressiveness and Generalization
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Positional encodings are a core component of transformer-based architectures, enabling such models to process sequential data without recurrence. Despite their critical role, the theoretical properties of various positional encoding schemes—including sinusoidal, learned, relative, and recent bias-based methods such as Attention with Linear Biases (ALiBi)—remain poorly understood. In this paper, we present a comprehensive theoretical framework to analyze how different positional encodings affect a transformer’s expressiveness, generalization ability, and extrapolation to sequences longer than those seen during training. We derive formal definitions of expressiveness in terms of function approximation classes, obtain generalization bounds under different encoding schemes using Rademacher complexity analyses, and propose several novel positional encoding methods based on orthogonal function families (e.g., wavelets, Legendre polynomials) and information-theoretic criteria. We also characterize the extrapolation capacity of existing and proposed encodings, extending ALiBi’s biasing approach to a more unified theoretical setting. Our lightweight experimental evaluation on synthetic sequence-to-sequence tasks validates key theoretical predictions, showing that encoding schemes grounded in orthogonal transforms can outperform standard sinusoidal encodings in both generalization and extrapolation. This work fills an important gap in transformer theory, offering new insights that can guide design choices in natural language processing, computer vision, and other domains where transformers dominate.