An Empirical Investigation into Fine-Tuning Methodologies: A Comparative Benchmark of LoRA for Vision Transformers

Satwik Sai Prakash

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

In the field of modern computer vision, the standard practice for solving new problems is not to train a model from scratch but to adapt a massive, pre-trained model. This process, known as fine-tuning, is a cornerstone of deep learning. This paper investigates two fundamentally different philosophies for fine-tuning. The first is the traditional, widely used approach where the core of a pre-trained Convolutional Neural Network (CNN) is kept frozen, and only a small, new classification layer is trained. This method is fast and computationally cheap. The second is a more modern, parameter-efficient fine-tuning (PEFT) technique called Low-Rank Adaptation (LoRA), which allows for a more profound adaptation of a model’s internal workings without the immense cost of full retraining. To test these competing methods, we designed a benchmark using three powerful, pre-trained models. For the traditional approach, we used ResNet50 and EfficientNet-B0, two highly influential CNNs. For the modern approach, we used a Vision Transformer (ViT), an architecture that processes images using a self-attention mechanism, and adapted it with LoRA. We then evaluated these models on three datasets of increasing complexity: the simple MNIST (handwritten digits), the moderately complex Fashion-MNIST (clothing items), and the significantly more challenging CIFAR-10 (color photos of objects like cars, dogs, and ships). This ladder of complexity was designed to reveal under which conditions each fine-tuning strategy excels or fails. The outcomes were striking on the intricate CIFAR-10 dataset: the ViT-LoRA model performed exceptionally well, achieving 97.3% validation accuracy, while the ResNet50 and EfficientNet-B0 models only managed 83.4% and 80.5%, respectively. The task was not difficult enough to distinguish between the models on the much easier MNIST dataset, though, and all of them received nearly flawless scores of 99%. Critically, the superior performance of the ViT-LoRA model was achieved with incredible efficiency. The LoRA method required training only 0.58% of the total model parameters, a tiny fraction of the 8-11% of parameters that needed to be trained for the traditional CNN approach. This leads to the central conclusion of our work: LoRA is not just a more efficient method, but a more effective one. For complex, real-world tasks, the ability to adapt a model’s internal representations, as LoRA does for the ViT’s attention layers, provides a decisive performance advantage that the rigid, classifier-only fine-tuning method cannot match.

Version published to 10.20944/preprints202510.2514.v1
Nov 3, 2025

From Scratch to Fine Tuning: Comparing Transfer Learning and CNN Training Strategies on Five Bangladesh-Centric Datasets

This article has 4 authors:
1. Minhaz Kamal
2. Md. Mushfiqul Haque
3. Rafid Nahiyan Farabi
4. Muhammad Ibrahim
This article has no evaluationsLatest version Jan 9, 2026
A Comprehensive Comparative Analysis of Convolutional Neural Network Architectures for Image Classification and Object Detection Tasks

This article has 3 authors:
1. Fahim Al Islam
2. Saif Hossain
3. Monir Hosen
This article has no evaluationsLatest version Feb 3, 2026
Two-Stage Fine-Tuning of Large Vision-Language Models with Hierarchical Prompting for Few-Shot Object Detection in Remote Sensing Images

This article has 7 authors:
1. Yongqi Shi
2. Ruopeng Yang
3. Changsheng Yin
4. Yiwei Lu
5. Bo Huang
6. Yu Tao
7. Yihao Zhong
This article has no evaluationsLatest version Jan 14, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

From Scratch to Fine Tuning: Comparing Transfer Learning and CNN Training Strategies on Five Bangladesh-Centric Datasets

A Comprehensive Comparative Analysis of Convolutional Neural Network Architectures for Image Classification and Object Detection Tasks

Two-Stage Fine-Tuning of Large Vision-Language Models with Hierarchical Prompting for Few-Shot Object Detection in Remote Sensing Images