Enhancing ConvNeXt for efficient small-size image classification
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Vision Transformers (ViTs) have achieved remarkable success in computer vision, particularly with the advent of the Swin Transformer (Swin-T). Recently, ConvNeXt was proposed to revisit the architecture of convolutional neural networks by incorporating techniques from Swin-T, which achieves competitive performance. However, ConvNeXt exhibited relatively lower accuracy and efficiency on small-size datasets with low-resolution images. To address this issue, based on the structure of ConvNeXt, we propose to employ small kernel sizes to better capture features from low-resolution images. A symmetrical inverted bottleneck structure inspired by MobileNet is used to refine the traditional residual block design. We combine the batch normalization and layer normalization to strengthen the relationships between different samples. We use a twice-patch embedding methodology to dynamically generate adaptive patch sizes according to different input image sizes. Subsequently, we integrate the global response normalization to amplify the contrast and specificity of multiple channels for learning diverse features. Finally, we introduce a novel spatial coordinate attention mechanism that effectively captures global feature information. The proposed method demonstrates superior performance on small-size datasets of CIFAR-10, CIFAR-100, Tiny-ImageNet and Fashion-MNIST with fewer parameters, which confirms the effectiveness and efficiency of our approach.