CMAFNet: Efficient Cross-Modal Alignment and Fusion for Real-Time RGB–Infrared Object Detection in Autonomous Driving

Zi-Han Huang
Chen-Wei Liang
Mu-Jiang-Shan Wang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

RGB–infrared (IR) fusion is an effective way to improve object detection robustness for automotive perception under low-light and adverse-weather conditions. Yet, practical multi-modal detectors still face three issues: imperfect cross-modal alignment, inefficient long-range interaction, and unstable query initialization when modalities exhibit inconsistent evidence. This paper presents CMAFNet, a deployment-oriented cross-modal alignment and fusion network with three key designs. (1) A Dynamic Receptive Backbone (DRB) extracts multi-scale features with adaptive receptive fields for both modalities. (2) A Channel-Split Mamba Block (CSM-Block) models long-range cross-modal dependencies using selective state-space modeling with linear complexity in token length, enabling an efficient accuracy–latency trade-off. (3) A Global Multi-modal Interaction Network (GMIN) performs fine-grained alignment and adaptive fusion via dual-branch cross-attention guided by global average/max pooling. In addition, an uncertainty-minimal query selection strategy and a separable dynamic decoder further enhance detection stability and efficiency. Experiments on M3FD and FLIR-Aligned show that CMAFNet achieves 83.9% mAP50 and 84.2% mAP50, respectively, while maintaining competitive inference efficiency, supporting real-time automotive deployment on compute-constrained platforms.

Version published to 10.20944/preprints202603.0404.v1
Mar 5, 2026

Cross-Modal Invariant Representation Learning for Robust Image-to-PointCloud Place Recognition

This article has 2 authors:
1. Shuxin Mo
2. Bowen Lou
This article has no evaluationsLatest version Jan 29, 2026
Dual-Modal Gated Fusion-Driven BEV 3D Object Detection: Enhancing Sustainable Intelligent Transportation in Nighttime Autonomous Driving

This article has 4 authors:
1. Peifeng Liang
2. Ye Zhang
3. Xinyue Wu
4. Qiongyuan Wu
This article has no evaluationsLatest version Mar 3, 2026
MFFP-Net: Multi-directional Feature Fusion and Position-Aware Network

This article has 4 authors:
1. Yazhong Si
2. Jingyu Chen
3. Hongxu Li
4. Chen Li
This article has no evaluationsLatest version Mar 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Cross-Modal Invariant Representation Learning for Robust Image-to-PointCloud Place Recognition

Dual-Modal Gated Fusion-Driven BEV 3D Object Detection: Enhancing Sustainable Intelligent Transportation in Nighttime Autonomous Driving

MFFP-Net: Multi-directional Feature Fusion and Position-Aware Network