CAS-FGvit: An Efficient Convolutional Additive Mixer for Fine-Grained Classification

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Fine-grained visual classification (FGVC) aims to recognize objects from subcategories, which is a very challenging task due to the subtle inter-class differences in nature. Most existing works mainly address this problem by reusing the backbone network to extract features of detected discriminative regions. However, this strategy inevitably complicates the process and forces the proposed regions to contain large parts of the object, thus failing to find the truly important parts. Recently, Convolutional Additive Visual Transformer (CAS-ViT) has shown outstanding performance in traditional classification tasks. In this work, we first evaluate the effectiveness of the CAS-ViT framework in the fine-grained recognition setting. First, we propose a part selection module that integrates all the original attention weights of the Transformer into the attention map to guide the network to effectively and accurately select discriminative image patches and compute their relations. Then, we propose an efficient multi-scale attention module to further reduce the computational workload. The enhanced model based on CAS-ViT is named CAS-FGvit and demonstrated its value by conducting experiments on the Lycium barbarum dataset and three popular fine-grained benchmarks, where we achieve state-of-the-art performance. To better understand our model, we present qualitative results.

Article activity feed