HCTNet: Hybrid CNN--Mamba Network for Real-Time Semantic Segmentation in Urban Traffic Scenes

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Real-time semantic segmentation in urban traffic scenes must preserve fine structures while capturing long-range context under strict latency constraints. We propose HCTNet, a hybrid CNN--Mamba framework that performs single-branch CNN inference and leverages a training-only Mamba auxiliary branch to inject global context during optimization. The method introduces a lightweight Convolutional State Module (CSM) to enlarge the effective receptive field within the CNN backbone and a Feature Alignment Module (FAM) to align multi-scale representations from the CNN and Mamba streams via spatial/channel projections and gated fusion. A single shared decoder is used for all streams during training to enforce a common prediction space; at test time only the CNN path with the shared decoder is executed to retain real-time efficiency. On Cityscapes, HCTNet attains 81.0\% mean Intersection-over-Union (mIoU) at 60.5 frames per second (FPS) and reaches up to 108.9 FPS with an optimized inference setting; under a reduced input scale it achieves 80.3\% mIoU at 98.6 FPS. On ApolloScape (mapped to the Cityscapes-19 taxonomy), HCTNet obtains 73.8\% mIoU. Qualitative results show sharper boundaries and more coherent predictions for small and distant objects. Ablation studies indicate that receptive-field enhancement from CSM, training-time global guidance from the Mamba branch, and multi-scale alignment through FAM jointly account for the gains, while the shared decoder regularizes predictions without increasing inference cost.

Article activity feed