Optimizing Multi-GPU Training with Data Parallelism and Batch Size Selection

Oleksii Kuziv
Mariia Nazarkevych

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study investigated the optimization of deep learning model training in multi-GPU environments, focusing on the impact of batch size on computational efficiency and model performance. The authors conducted experiments using the MobileNetV2 architecture on a large-scale image dataset, employing single-GPU and multi-GPU setups. It was found that multi-GPU setups significantly reduced training times and mitigated memory constraints. Batch size configurations of 16, 32, 64, and 128 were analyzed to determine their influence on validation accuracy and convergence rates. The study established that a batch size of 64 provided the best balance between training efficiency and model generalization. The research highlights the benefits of data parallelism and multi-GPU systems while addressing the trade-offs between computational speed and accuracy. Suggestions for future work include developing adaptive batch size techniques and extending the analysis to other architectures and datasets.

Version published to 10.32388/iq9xcj
Dec 5, 2024

Extending a Moldable Computer Architecture to Accelerate DL Inference on FPGA

This article has 6 authors:
1. Mirko Mariotti
2. Giulio Bianchini
3. Igor Neri
4. Daniele Spiga
5. Diego Ciangottini
6. Loriano Storchi
This article has no evaluationsLatest version May 27, 2025
XTorch: A High-Performance C++ Framework for Deep Learning Training

This article has 1 author:
1. Kamran Saberifard
This article has no evaluationsLatest version Jul 7, 2025
Automatic GPU Memory Access Optimization for AoSoA-based Application in OP2 Framework

This article has 6 authors:
1. Tong Lei
2. Zongjing Chen
3. Yonggang Che
4. Chuanfu Xu
5. Zhe Dai
6. Jian Zhang
This article has no evaluationsLatest version Jun 17, 2025

Listed in

Abstract

Article activity feed

Related articles

Extending a Moldable Computer Architecture to Accelerate DL Inference on FPGA

XTorch: A High-Performance C++ Framework for Deep Learning Training

Automatic GPU Memory Access Optimization for AoSoA-based Application in OP2 Framework