WLAM Attention: Plug-and-Play Wavelet Transform Linear Attention
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Linear attention has gained popularity in recent years due to its lower computational complexity compared to Softmax attention. However, its relatively lower performance has limited its widespread application. To address this issue, we propose a plug-and-play module called the Wavelet-Enhanced Linear Attention Mechanism (WLAM), which integrates a discrete wavelet transform (DWT) with linear attention. This approach enhances the model’s ability to express global contextual information while improving the capture of local features. Firstly, we introduce the DWT into the attention mechanism to decompose the input features. The original input features are utilized to generate the query vector Q, while the low-frequency coefficients are used to generate the key K. The high-frequency coefficients undergo convolution to produce the value V. This method effectively embeds global and local information into different components of the attention mechanism, thereby enhancing the model’s perception of details and overall structure. Secondly, we perform multi-scale convolution on the high-frequency wavelet coefficients and incorporate a Squeeze-and-Excitation (SE) module to enhance feature selectivity. Subsequently, we utilize the inverse discrete wavelet transform (IDWT) to reintegrate the multi-scale processed information back into the spatial domain, addressing the limitations of linear attention in handling multi-scale and local information. Finally, inspired by certain structures of the Mamba network, we introduce a forget gate and an improved block design into the linear attention framework, inheriting the core advantages of the Mamba architecture. Following a similar rationale, we leverage the lossless downsampling property of wavelet transforms to combine the downsampling module with the attention module, resulting in the Wavelet Downsampling Attention (WDSA) module. This integration reduces the network size and computational load while mitigating information loss associated with downsampling. We apply the Wavelet-Enhanced Linear Attention Mechanism (WLAM) to classical networks such as PVT, Swin, and CSwin, achieving significant improvements in performance on image classification tasks. Furthermore, we combine wavelet linear attention with the Wavelet Downsampling Attention (WDSA) module to construct WDLMFormer, which achieves an accuracy of 84.2% on the ImageNet-1K dataset.