Multiscale scene parsing network

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

To address the core challenge faced by existing lightweight scene parsing networks—balancing multiscale feature representation precision and computational efficiency (rather than “difficulties in extracting multi-scale information”)—this paper proposes MSPNet, a lightweight multiscale scene parsing network. The network adopts StarNet as the backbone to leverage its efficient low-to-high dimensional feature transformation capability, and innovatively embeds the Efficient Pixel Localization Attention (EPLA) module into the PSPNet architecture. Unlike simple module stacking, the EPLA module integrates two synergistic submodules: ELA (Efficient Localization Attention) and PagFM (Pyramid Attention-Guided Feature Module). The ELA module uses a dynamic weight allocation mechanism to achieve precise pixel-level feature localization while reducing attention computation overhead by 38%; the PagFM module constructs a hierarchical pyramid fusion architecture, adaptively guiding cross-scale feature integration to enhance small-target representation. Additionally, MSPNet incorporates depthwise separable convolutions and channel reparameterization techniques, further optimizing model compactness. Experimental results on the Pascal VOC2012 validation set show that MSPNet achieves a mean Intersection over Union (mIoU) of 87.19%, a 1.79% improvement over PSPNet. With GFLOPs (9.7G for StarNet-s4 backbone) and parameter counts (7.4 M) comparable to the MobileNet series, MSPNet outperforms contemporary lightweight SOTA models in both accuracy and efficiency, providing an effective solution for real-time semantic segmentation on resource-constrained mobile devices. The code for MSPNet is available at https://github.com/Eric-863/MSPnet .

Article activity feed