Weakly Supervised Semantic Segmentation Based on Subspace- Decoupled Representations and Cross-Layer CAM Structural Alignment
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Weakly Supervised Semantic Segmentation (WSSS) aims to learn pixel-level semantic predictions using only image-level annotations. However, due to the absence of precise spatial supervision, the generated Class Activation Maps (CAMs) often highlight only the most discriminative regions of objects, resulting in incomplete object coverage and unstable cross-layer semantic responses. To address these challenges, we propose a token-level contrastive learning based framework for WSSS, which improves CAM localization quality by enhancing feature representation and enforcing cross-layer structural consistency. Specifically, we first introduce a multi-subspace token-level contrastive module, which decouples feature representations through a shared semantic backbone and multiple projection subspaces, thereby increasing the diversity and discriminability of the embedding space. Furthermore, we propose a cross-layer CAM structural alignment module that jointly constrains both the response intensity and spatial structural relationships of CAMs across different Transformer layers, leading to more stable semantic localization and improved spatial consistency of object regions. Extensive experiments on the PASCAL VOC 2012 and MS COCO 2014 benchmarks demonstrate that the proposed method consistently improves segmentation performance under an end-to-end training framework. In particular, it achieves 71.8% (val) and 72.3% (test) mIoU on VOC 2012, and 42.6% mIoU on the COCO 2014 validation set. Further ablation studies validate the effectiveness of each component. Overall, our method significantly enhances the completeness and structural stability of CAMs, providing an effective solution for representation learning and structural modeling in WSSS.