HCTNet: Hybrid CNN--Mamba Network for Real-Time Semantic Segmentation in Urban Traffic Scenes
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Real-time semantic segmentation in urban traffic scenes must preserve fine structures while capturing long-range context under strict latency constraints. We propose HCTNet, a hybrid CNN--Mamba framework that performs single-branch CNN inference and leverages a training-only Mamba auxiliary branch to inject global context during optimization. The method introduces a lightweight Convolutional State Module (CSM) to enlarge the effective receptive field within the CNN backbone and a Feature Alignment Module (FAM) to align multi-scale representations from the CNN and Mamba streams via spatial/channel projections and gated fusion. A single shared decoder is used for all streams during training to enforce a common prediction space; at test time only the CNN path with the shared decoder is executed to retain real-time efficiency. On Cityscapes, HCTNet attains 81.0\% mean Intersection-over-Union (mIoU) at 60.5 frames per second (FPS) and reaches up to 108.9 FPS with an optimized inference setting; under a reduced input scale it achieves 80.3\% mIoU at 98.6 FPS. On ApolloScape (mapped to the Cityscapes-19 taxonomy), HCTNet obtains 73.8\% mIoU. Qualitative results show sharper boundaries and more coherent predictions for small and distant objects. Ablation studies indicate that receptive-field enhancement from CSM, training-time global guidance from the Mamba branch, and multi-scale alignment through FAM jointly account for the gains, while the shared decoder regularizes predictions without increasing inference cost.