Enhancing ConvNeXt for efficient small-size image classification

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Vision Transformers (ViTs) have achieved remarkable success in computer vision, particularly with the advent of the Swin Transformer (Swin-T). Recently, ConvNeXt was proposed to revisit the architecture of convolutional neural networks by incorporating techniques from Swin-T, which achieves competitive performance. However, ConvNeXt exhibited relatively lower accuracy and efficiency on small-size datasets with low-resolution images. To address this issue, based on the structure of ConvNeXt, we propose to employ small kernel sizes to better capture features from low-resolution images. A symmetrical inverted bottleneck structure inspired by MobileNet is used to refine the traditional residual block design. We combine the batch normalization and layer normalization to strengthen the relationships between different samples. We use a twice-patch embedding methodology to dynamically generate adaptive patch sizes according to different input image sizes. Subsequently, we integrate the global response normalization to amplify the contrast and specificity of multiple channels for learning diverse features. Finally, we introduce a novel spatial coordinate attention mechanism that effectively captures global feature information. The proposed method demonstrates superior performance on small-size datasets of CIFAR-10, CIFAR-100, Tiny-ImageNet and Fashion-MNIST with fewer parameters, which confirms the effectiveness and efficiency of our approach.

Article activity feed