CacheFormer: High Attention-Based Segment Caching

Sushant Singh
Ausif Mahmood

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Efficiently handling long contexts in transformer-based language models with low perplexity is an active area of research. Numerous recent approaches like Linformer, Longformer, Performer, Structured state space models (SSMs)., could not fully resolve this problem. All these models strive to reduce the quadratic time complexity of the attention mechanism while minimizing the loss in quality due to the effective compression of the long context. Inspired by the cache memory mechanism in computer architecture, we improve the work presented in Long-Short Transformer (Transformer LS). Our enhancements include augmenting the architecture with attention on dynamically retrieved uncompressed segments that indicate high attention at the compressed level. Like the cache memory principle, during a cache miss, not only the needed data is retrieved from the memory, but the nearby following data is also obtained. On a similar note, we too retrieve the nearby segments in uncompressed form when a high attention occurs at the compressed level. We further enhance the Transformer LS by augmenting the long attention with compressed overlapping segments to reduce the loss in quality due to segment fragmentation that occurs in sequences with long context. Our perplexity results indicate significant improvements over Transformer LS and other SOTA language models.

Version published to 10.20944/preprints202502.0107.v1
Feb 3, 2025

Temporal Modeling with Reversible Transformers

This article has 1 author:
1. Leonid Kulyk
This article has no evaluationsLatest version Mar 25, 2025
Breaking the Bottleneck Advances in Efficient Transformer Design

This article has 1 author:
1. Yawen Bao
This article has no evaluationsLatest version Feb 28, 2025
ARWKV: Pretraining Is Not What We Need – An RNN-Attention-Based Language Model Born From Transformer

This article has 4 authors:
1. Yueyu Lin
2. Zhiyuan Li
3. Peter Yue
4. Xiao Liu
This article has no evaluationsLatest version Feb 5, 2025

Listed in

Abstract

Article activity feed

Related articles

Temporal Modeling with Reversible Transformers

Breaking the Bottleneck Advances in Efficient Transformer Design

ARWKV: Pretraining Is Not What We Need – An RNN-Attention-Based Language Model Born From Transformer