A Comprehensive Survey on Distributed Deep Learning Training: Parallelism Strategies, Frameworks, and Network Interconnects

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rapid growth of large language models (LLMs) and deep neural networks has necessitated the development of sophisticated distributed training techniques. Models like GPT-4 with trillions of parameters cannot be trained on a single GPU, making distributed training across multiple GPUs and nodes essential. This survey provides a comprehensive overview of distributed deep learning training technologies, covering four key dimensions: (1) parallelism strategies including data parallelism, tensor parallelism, pipeline parallelism, and their combinations; (2) training frameworks such as DeepSpeed, Megatron-LM, GPipe, and PyTorch FSDP; (3) communication optimization techniques including collective operations, gradient compression, and computation-communication overlap; and (4) network interconnect technologies including NVLink, NVSwitch, InfiniBand, and RDMA over Converged Ethernet (RoCE). We analyze the trade-offs between memory efficiency, computational efficiency, and communication overhead for each approach. Furthermore, we discuss practical deployment considerations for single-node multi-GPU and multi-node multi-GPU configurations. Finally, we identify open challenges and future research directions in this rapidly evolving field.

Article activity feed