LGMFuse: A Multi-modal Image Fusion Method Based on Local and Global Mamba
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
To address the challenges in infrared-visible light and polarization-visible light image fusion tasks—such as insufficient feature extraction, difficulty in simultaneously capturing global dependencies and local spatial information, and the lack of multi-task adaptability in existing models optimized for single tasks—this paper proposes an end-to-end solution named LGMFuse. The core innovation of LGMFuse lies in its Local and Global Mamba (LGM) module, which significantly enhances multi-directional perception through an eight-directional scanning mechanism. The LGM module comprises two parallel branches: global four-directional scanning and local multi-scale four-directional scanning, designed to extract spatial local features and capture global dependencies, respectively. In the encoding stage, LGMFuse employs a three-stage feature extraction architecture, progressively extracting multi-scale multimodal features via the Local and Global Mamba Encode Block (LGME), while acquiring higher-level semantic information as the network depth increases. In the fusion stage, the Local and Global Mamba Fusion Block (LGMF) focuses on the deep fusion of feature maps from different modalities at the same scale, preserving complementary characteristics while minimizing redundancy and noise to ensure the precision of the fused features. Furthermore, LGMFuse introduces a collaborative encoding mechanism, where the LGME and LGMF modules interact closely across multiple stages and scales during feature extraction and fusion, markedly improving information integration efficiency and enhancing the model's robustness and generalization capability. Experimental results demonstrate that LGMFuse achieves state-of-the-art performance in multimodal fusion accuracy and object detection across three public datasets.