WP-ViT2Level: Multi-Level Wavelet-Patch Vision Transformers for Robust SAR Automatic Target Recognition
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
By integrating multi-level wavelet decomposition into the tokenization stage of Vision Transformers, we introduce a frequency-aware representation framework tailored for synthetic aperture radar (SAR) automatic target recognition (ATR). The proposed Wavelet-Patch Vision Transformer++ (WP-ViT++) decomposes SAR images into multi-resolution frequency sub-bands, enabling explicit separation of global structural information and high-frequency scattering features. Through wavelet-domain denoising and sub-band token embedding, the model enhances robustness against speckle noise while preserving discriminative target characteristics. A cross-wavelet attention mechanism further enables joint modeling of spatial–frequency dependencies, improving the representation of complex SAR signatures. Unlike conventional transformer-based approaches that rely solely on spatial patches, the proposed method incorporates domain-aligned frequency priors, leading to more stable and noise-resilient feature learning. Experimental results on the MSTAR benchmark demonstrate that WP-ViT + + achieves 93.6% classification accuracy, outperforming ViT, SpectFormer-Lite, and DiffFormer-Lite by significant margins. In addition, the proposed model maintains strong performance under noise perturbations, achieving over 93% accuracy under speckle noise conditions. These results confirm that wavelet-enhanced tokenization provides an effective and scalable solution for robust SAR ATR, improving both classification accuracy and generalization without increasing architectural complexity.