Theoretical Analysis of Positional Encodings in Transformer Models: Impact on Expressiveness and Generalization

Yin Li

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Positional encodings are a core component of transformer-based architectures, enabling such models to process sequential data without recurrence. Despite their critical role, the theoretical properties of various positional encoding schemes—including sinusoidal, learned, relative, and recent bias-based methods such as Attention with Linear Biases (ALiBi)—remain poorly understood. In this paper, we present a comprehensive theoretical framework to analyze how different positional encodings affect a transformer’s expressiveness, generalization ability, and extrapolation to sequences longer than those seen during training. We derive formal definitions of expressiveness in terms of function approximation classes, obtain generalization bounds under different encoding schemes using Rademacher complexity analyses, and propose several novel positional encoding methods based on orthogonal function families (e.g., wavelets, Legendre polynomials) and information-theoretic criteria. We also characterize the extrapolation capacity of existing and proposed encodings, extending ALiBi’s biasing approach to a more unified theoretical setting. Our lightweight experimental evaluation on synthetic sequence-to-sequence tasks validates key theoretical predictions, showing that encoding schemes grounded in orthogonal transforms can outperform standard sinusoidal encodings in both generalization and extrapolation. This work fills an important gap in transformer theory, offering new insights that can guide design choices in natural language processing, computer vision, and other domains where transformers dominate.

Version published to 10.20944/preprints202506.0534.v1
Jun 9, 2025

Unified Instruction Encoding and Gradient Coordination for Multi-Task Language Models

This article has 6 authors:
1. Wuyang Zhang
2. Zhaoyang Xu
3. Yexin Tian
4. Yan Wu
5. Mengjie Wang
6. Xiandong Meng
This article has no evaluationsLatest version Jun 9, 2025
Text Distance from Nested and Hierarchical Repetitions: A Compression-Based Perspective

This article has 7 authors:
1. Xiaojun Hu
2. Jing Wang
3. Jingwen Zhang
4. Fengyao Zhai
5. Xiao Xie
6. Zengru Di
7. Yu Liu
This article has no evaluationsLatest version Jun 12, 2025
ϕ^∞ : Clause Purification, Embedding Realignment, and the Total Suppression of the Em Dash in Autoregressive Language Models

This article has 2 authors:
1. Faruk Alpay
2. Buğra Kılıçtaş
This article has no evaluationsLatest version Jun 23, 2025

Listed in

Abstract

Article activity feed

Related articles

Unified Instruction Encoding and Gradient Coordination for Multi-Task Language Models

Text Distance from Nested and Hierarchical Repetitions: A Compression-Based Perspective

ϕ^∞ : Clause Purification, Embedding Realignment, and the Total Suppression of the Em Dash in Autoregressive Language Models