DAA:Dynamic attention allocation improves large-scale model reasoning
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The Transformer architecture has signifi- cantly advanced natural language processing (NLP) and has been foundational in devel- oping large language models (LLMs) such as LLaMA and ChatGPT. Despite their supe- rior accuracy, LLMs present unique challenges in practical inference, concerning the compute and memory-intensive nature. Multi-Head At- tention is one of the key components of LLMs, which can account for over 50% of LLMs mem- ory and compute requirement.We observe that there is a high degree of redundancy among heads about which tokens they pay attention to in different sequences. Based on this find- ing, we propose Dynamic Head Attention Al- location (DAA). DAA combines two-stage at- tention heads with a high amount of correla- tion for self-attention within chunks and lay- ers, which combines both local and global at- tention, thus reducing both memory and com- pute.