DAA:Dynamic attention allocation improves large-scale model reasoning

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The Transformer architecture has signifi- cantly advanced natural language processing (NLP) and has been foundational in devel- oping large language models (LLMs) such as LLaMA and ChatGPT. Despite their supe- rior accuracy, LLMs present unique challenges in practical inference, concerning the compute and memory-intensive nature. Multi-Head At- tention is one of the key components of LLMs, which can account for over 50% of LLMs mem- ory and compute requirement.We observe that there is a high degree of redundancy among heads about which tokens they pay attention to in different sequences. Based on this find- ing, we propose Dynamic Head Attention Al- location (DAA). DAA combines two-stage at- tention heads with a high amount of correla- tion for self-attention within chunks and lay- ers, which combines both local and global at- tention, thus reducing both memory and com- pute.

Article activity feed