WP-ViT2Level: Multi-Level Wavelet-Patch Vision Transformers for Robust SAR Automatic Target Recognition

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

By integrating multi-level wavelet decomposition into the tokenization stage of Vision Transformers, we introduce a frequency-aware representation framework tailored for synthetic aperture radar (SAR) automatic target recognition (ATR). The proposed Wavelet-Patch Vision Transformer++ (WP-ViT++) decomposes SAR images into multi-resolution frequency sub-bands, enabling explicit separation of global structural information and high-frequency scattering features. Through wavelet-domain denoising and sub-band token embedding, the model enhances robustness against speckle noise while preserving discriminative target characteristics. A cross-wavelet attention mechanism further enables joint modeling of spatial–frequency dependencies, improving the representation of complex SAR signatures. Unlike conventional transformer-based approaches that rely solely on spatial patches, the proposed method incorporates domain-aligned frequency priors, leading to more stable and noise-resilient feature learning. Experimental results on the MSTAR benchmark demonstrate that WP-ViT + + achieves 93.6% classification accuracy, outperforming ViT, SpectFormer-Lite, and DiffFormer-Lite by significant margins. In addition, the proposed model maintains strong performance under noise perturbations, achieving over 93% accuracy under speckle noise conditions. These results confirm that wavelet-enhanced tokenization provides an effective and scalable solution for robust SAR ATR, improving both classification accuracy and generalization without increasing architectural complexity.

Article activity feed