Enhancing ConvNeXt for efficient small-size image classification

Jianwei Feng
Jinguo Mo
Hengliang Tan
Shuo Yang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Vision Transformers (ViTs) have achieved remarkable success in computer vision, particularly with the advent of the Swin Transformer (Swin-T). Recently, ConvNeXt was proposed to revisit the architecture of convolutional neural networks by incorporating techniques from Swin-T, which achieves competitive performance. However, ConvNeXt exhibited relatively lower accuracy and efficiency on small-size datasets with low-resolution images. To address this issue, based on the structure of ConvNeXt, we propose to employ small kernel sizes to better capture features from low-resolution images. A symmetrical inverted bottleneck structure inspired by MobileNet is used to refine the traditional residual block design. We combine the batch normalization and layer normalization to strengthen the relationships between different samples. We use a twice-patch embedding methodology to dynamically generate adaptive patch sizes according to different input image sizes. Subsequently, we integrate the global response normalization to amplify the contrast and specificity of multiple channels for learning diverse features. Finally, we introduce a novel spatial coordinate attention mechanism that effectively captures global feature information. The proposed method demonstrates superior performance on small-size datasets of CIFAR-10, CIFAR-100, Tiny-ImageNet and Fashion-MNIST with fewer parameters, which confirms the effectiveness and efficiency of our approach.

Version published to 10.21203/rs.3.rs-7965188/v1 on Research Square
Nov 17, 2025

Bidirectional Aware Vision Mamba for Lightweight Single Image Super-Resolution

This article has 3 authors:
1. Xin Wu
2. Junfeng Yang
3. Dongyang Zhang
This article has no evaluationsLatest version Nov 7, 2025
FiT: Feature Integration Transformer with Universal Language Interface for Multi-Task Vision

This article has 6 authors:
1. Sana Cheema
2. Ghulam Gilanie
3. Tariq Alsahfi
4. Sami Alesawi
5. Raed Alsini
6. Ali Daud
This article has no evaluationsLatest version Oct 9, 2025
Building Shape Recognition Based on Improved ZFNet for Map Generalization

This article has 4 authors:
1. Huimin Liu
2. Chengkai Tan
3. Jianbo Tang
4. Min Deng
This article has no evaluationsLatest version Oct 14, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Bidirectional Aware Vision Mamba for Lightweight Single Image Super-Resolution

FiT: Feature Integration Transformer with Universal Language Interface for Multi-Task Vision

Building Shape Recognition Based on Improved ZFNet for Map Generalization