Comparative Study on Vision Transformer and Convolutional Neural Networks for Solar Image Classification
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
With the rapid advancement of solar observational technologies and the surge of multi-wavelength data acquisition, solar physics has entered the era of big data, posing new challenges for image analysis and classification. In this study, we present a systematic comparison between the Vision Transformer (ViT) and Convolutional Neural Networks (CNNs), focusing on their performance and underlying mechanisms in classifying solar images of the photosphere and chromosphere observed by the New Vacuum Solar Telescope (NVST). Using transfer learning with ImageNet-1k pretrained weights and data augmentation strategies, both models were trained and evaluated on a multi-class dataset of manually labeled solar images. Our results show that while ViT achieves comparable classification performance to CNNs, it exhibits greater potential in handling images with multi-structured solar features.Attention-based visualizations using Grad-CAM reveal that ViT tends to focus more broadly and semantically coherently on key solar features, such as sunspot and its penumbrae. In contrast, CNNs are more prone to focusing on dominant local features, which may limit their effectiveness in complex classification scenarios. Finally, we reveal for the first time the potential of ViT in solar image segmentation and recognition, highlighting its attention maps' strong alignment with characteristic solar morphological features.