CMAFNet: Efficient Cross-Modal Alignment and Fusion for Real-Time RGB–Infrared Object Detection in Autonomous Driving

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

RGB–infrared (IR) fusion is an effective way to improve object detection robustness for automotive perception under low-light and adverse-weather conditions. Yet, practical multi-modal detectors still face three issues: imperfect cross-modal alignment, inefficient long-range interaction, and unstable query initialization when modalities exhibit inconsistent evidence. This paper presents CMAFNet, a deployment-oriented cross-modal alignment and fusion network with three key designs. (1) A Dynamic Receptive Backbone (DRB) extracts multi-scale features with adaptive receptive fields for both modalities. (2) A Channel-Split Mamba Block (CSM-Block) models long-range cross-modal dependencies using selective state-space modeling with linear complexity in token length, enabling an efficient accuracy–latency trade-off. (3) A Global Multi-modal Interaction Network (GMIN) performs fine-grained alignment and adaptive fusion via dual-branch cross-attention guided by global average/max pooling. In addition, an uncertainty-minimal query selection strategy and a separable dynamic decoder further enhance detection stability and efficiency. Experiments on M3FD and FLIR-Aligned show that CMAFNet achieves 83.9% mAP50 and 84.2% mAP50, respectively, while maintaining competitive inference efficiency, supporting real-time automotive deployment on compute-constrained platforms.

Article activity feed