Two-Stage Cascaded Vision Transformer with Spatial Attention for Dense Settlement Detection in Remote Sensing Imagery

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Remote sensing is indispensable for studying traditional settlements in geographically inaccessible regions, however, dense settlement detection faces two challenges: spectral/textural similarity leads to redundant features that obscure macro-scale patterns, and fragmented structures impede fine-grained attribute extraction. To address these critical issues, this study proposes a novel two-stage cascaded network designed for capturing both global and local features. In Stage I, a Vision Transformer (SA-ViT) with integrated spatial attention and learnable gating extracts global features, discerning settlement-level patterns. Stage II employs a cascaded pyramid feature aggregation network with residual convolution modules to enhance feature reuse, enabling refined extraction of individual buildings and their attributes. When validated on Qiang villages in China, our framework achieves 98.1% settlement recognition accuracy and 94.4% precision in detecting architectural attributes. In addressing dense scene complexities, this framework significantly enhances remote sensing detection capabilities and contributes to the advancement of traditional settlement studies and cultural heritage preservation.

Article activity feed