From Transformer to Transponder: Introducing Contextual Modulation Training for Residual Learning in LLMs

Yingtao Zhang
Wenqi Gu
Wen Hu
Jianguo Li
Carlo Vittorio Cannistraci

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Transformers are the backbone of state-of-the-art systems across language, vision, and multimodal learning tasks, yet the relevance scale of their functional blocks (self-attention and feed-forward networks) is typically constant across inputs and depth. This static design neglects context-sensitive regulation of information flow through residual pathways. We introduce the \emph{contextual modulator}: a lightweight, input-aware mechanism that can scale the outputs of linear sublayers within a block or the entire block output at token- and channel-level granularity. The modulator is implemented via compact parametric functions and adds negligible parameter overhead. Building on this idea, we propose Transponder, which integrates contextual modulators throughout Transformer blocks to endow functional residual architectures with fine-grained, input-adaptive control. Transponder provides evident improvement over six other scaling or normalization methods across LLaMA backbones ranging from 60M to 250M parameters, yielding consistent perplexity reductions with $<1\%$ additional parameters. Analysis reveals depth-, module-, and token-specific scaling patterns, indicating that learned modulators act as input-adaptive regulators of residual information flow. Transponder provides a simple, general mechanism to augment Transformer-based models with context-sensitive modulators, providing robust and significant performance improvements without substantial architectural changes.

Version published to 10.20944/preprints202506.0120.v2
Sep 30, 2025
Version published to 10.20944/preprints202506.0120.v1
Jun 3, 2025

Evaluating Layer-sharing in Transformers for Language and Reasoning Tasks

This article has 1 author:
1. Renee Ge
This article has no evaluationsLatest version Oct 11, 2025
RLDSCP: Reducing Label Dependency with Self-Attention and Contrastive Pretraining

This article has 2 authors:
1. sai prabanjan kumar kalvapalli
2. MALA C
This article has no evaluationsLatest version Aug 27, 2025
Learning with Fewer Bits Across Layers and Time in the Training of Foundation-Scale Transformers

This article has 5 authors:
1. Oliver Hartley
2. Priya Desai
3. Nathaniel Brooks
4. Eleanor Hughes
5. Beverley Marion
This article has no evaluationsLatest version Sep 22, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Evaluating Layer-sharing in Transformers for Language and Reasoning Tasks

RLDSCP: Reducing Label Dependency with Self-Attention and Contrastive Pretraining

Learning with Fewer Bits Across Layers and Time in the Training of Foundation-Scale Transformers