Beyond All-Reduce: Event-Driven Model Parallelism Without Collective Communication Primitives (EBD2N)
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The rapid growth of Deep Neural Network (DNN) parameter sizes has rendered single-device training infeasible, necessitating complex parallelization strategies. Existing approaches rely on blocking collective communication primitives (e.g., All-Reduce) that introduce synchronization bottlenecks. We introduce the Event-Based Deep Neural Network (EBD2N), a distributed architecture replacing global collective synchronization with asynchronous point-to-point event messaging. EBD2N partitions layers both vertically (features) and horizontally (weights), enabling fine-grained distribution through localized gradient accumulation at each partition without parameter servers or global synchronization barriers. We mathematically formalize this architecture and prove its equivalence to standard DNN formulations. Empirical evaluation on an NVIDIA H200 cluster with 4 GPUs demonstrates that EBD2N achieves up to 1.67$\times$ throughput improvement over single-GPU baselines on high-dimensional input tasks, structurally surpassing pipeline parallelism in input-dominated scenarios. EBD2N offers a scalable alternative for training massive-scale models on hybrid infrastructure.