Multi-Scale Feature Fusion and Global Context Modeling for Fine-Grained Remote Sensing Image Segmentation

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

High-precision remote sensing image semantic segmentation plays a crucial role in Earth science analysis and urban management, especially in urban remote sensing scenarios with rich details and complex structures. In such cases, collaborative modeling of global and local contexts is a key challenge for improving segmentation accuracy. Existing methods that rely on single feature extraction architectures, such as Convolutional Neural Networks (i.e., CNNs) and Vision Transformers, are prone to semantic fragmentation due to the limited feature representation capabilities. To address this issue, we propose a hybrid architecture model called PLGTransformer, which is based on dual-encoder collaborative enhancement and integrates pyramid pooling and Graph Convolutional Networks (i.e., GCNs) modules. Our model innovatively constructs a parallel encoding architecture combining Swin Transformer and CNN: the CNN branch captures fine-grained features such as road and building edges through multi-scale heterogeneous convolutions; the Swin Transformer branch models global dependencies of large-scale land cover using hierarchical window attention. To further strengthen multi-granularity feature fusion, we design a dual-path pyramid pooling module to perform adaptive multi-scale context aggregation for both feature types and dynamically balance local-global contributions using learnable weights. Specifically, we introduce the GCNs to build a topological graph in the feature space, enabling geometric relationship reasoning for cross-modal feature nodes at high resolution. Experiments on the Potsdam and Vaihingen datasets show that our model outperforms contemporary advanced methods and significantly improves segmentation accuracy for small objects such as vehicles and individual buildings, thereby validating the effectiveness of the multi-feature collaborative enhancement mechanism.

Article activity feed