A Comprehensive Survey on Distributed Deep Learning Training: Parallelism Strategies, Frameworks, and Network Interconnects

Jiawei Xu
Chia Xin Liang
Ziqian Bi
Xiaoming Li
Danyang Zhang
Zhenyu Yu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid growth of large language models (LLMs) and deep neural networks has necessitated the development of sophisticated distributed training techniques. Models like GPT-4 with trillions of parameters cannot be trained on a single GPU, making distributed training across multiple GPUs and nodes essential. This survey provides a comprehensive overview of distributed deep learning training technologies, covering four key dimensions: (1) parallelism strategies including data parallelism, tensor parallelism, pipeline parallelism, and their combinations; (2) training frameworks such as DeepSpeed, Megatron-LM, GPipe, and PyTorch FSDP; (3) communication optimization techniques including collective operations, gradient compression, and computation-communication overlap; and (4) network interconnect technologies including NVLink, NVSwitch, InfiniBand, and RDMA over Converged Ethernet (RoCE). We analyze the trade-offs between memory efficiency, computational efficiency, and communication overhead for each approach. Furthermore, we discuss practical deployment considerations for single-node multi-GPU and multi-node multi-GPU configurations. Finally, we identify open challenges and future research directions in this rapidly evolving field.

Version published to 10.20944/preprints202512.2207.v1
Dec 24, 2025

RackWeave: Hierarchical Gradient Exchange for Distributed AI

This article has 1 author:
1. Vipul Razdan
This article has no evaluationsLatest version Jan 5, 2026
LinkShard: Communication-Centric Parameter Serving for Distributed AI

This article has 1 author:
1. Vipul Razdan
This article has no evaluationsLatest version Jan 7, 2026
Scalable physical deep learning using optical dynamics with state-skipping training

This article has 6 authors:
1. Yongbo Zhang
2. Mitsumasa Nakajima
3. Katsuma Inoue
4. Toshikazu Hashimoto
5. Yasuo Kuniyoshi
6. Kohei Nakajima
This article has no evaluationsLatest version Jan 13, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

RackWeave: Hierarchical Gradient Exchange for Distributed AI

LinkShard: Communication-Centric Parameter Serving for Distributed AI

Scalable physical deep learning using optical dynamics with state-skipping training