CacheFormer: High Attention-Based Segment Caching

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Efficiently handling long contexts in transformer-based language models with low perplexity is an active area of research. Numerous recent approaches like Linformer, Longformer, Performer, Structured state space models (SSMs)., could not fully resolve this problem. All these models strive to reduce the quadratic time complexity of the attention mechanism while minimizing the loss in quality due to the effective compression of the long context. Inspired by the cache memory mechanism in computer architecture, we improve the work presented in Long-Short Transformer (Transformer LS). Our enhancements include augmenting the architecture with attention on dynamically retrieved uncompressed segments that indicate high attention at the compressed level. Like the cache memory principle, during a cache miss, not only the needed data is retrieved from the memory, but the nearby following data is also obtained. On a similar note, we too retrieve the nearby segments in uncompressed form when a high attention occurs at the compressed level. We further enhance the Transformer LS by augmenting the long attention with compressed overlapping segments to reduce the loss in quality due to segment fragmentation that occurs in sequences with long context. Our perplexity results indicate significant improvements over Transformer LS and other SOTA language models.

Article activity feed