Beyond All-Reduce: Event-Driven Model Parallelism Without Collective Communication Primitives (EBD2N)

Ernesto Leite
Fabrice Mourlin
Youakim Badr
Pierre Paradinas

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid growth of Deep Neural Network (DNN) parameter sizes has rendered single-device training infeasible, necessitating complex parallelization strategies. Existing approaches rely on blocking collective communication primitives (e.g., All-Reduce) that introduce synchronization bottlenecks. We introduce the Event-Based Deep Neural Network (EBD2N), a distributed architecture replacing global collective synchronization with asynchronous point-to-point event messaging. EBD2N partitions layers both vertically (features) and horizontally (weights), enabling fine-grained distribution through localized gradient accumulation at each partition without parameter servers or global synchronization barriers. We mathematically formalize this architecture and prove its equivalence to standard DNN formulations. Empirical evaluation on an NVIDIA H200 cluster with 4 GPUs demonstrates that EBD2N achieves up to 1.67$\times$ throughput improvement over single-GPU baselines on high-dimensional input tasks, structurally surpassing pipeline parallelism in input-dominated scenarios. EBD2N offers a scalable alternative for training massive-scale models on hybrid infrastructure.

Version published to 10.21203/rs.3.rs-8972047/v1 on Research Square
Mar 5, 2026

Impact of Dynamic Voltage on GPU Energy Consumption for Real-Time Systems

This article has 3 authors:
1. Gamil Radman
2. Abdullah Alhussain
3. Nasro Min-Allah
This article has no evaluationsLatest version Apr 16, 2026
Revisiting CPUs for Protein Folding: Xeon-Based Acceleration of AlphaFold2

This article has 10 authors:
1. Narendra Chaudhary
2. Wei Yang
3. Dhiraj Kalamkar
4. Jianqian Zhou
5. Soumyadip Ghosh
6. Lei Xia
7. Manasi Tiwari
8. Alexander Heinecke
9. Bharat Kaul
10. Sanchit Misra
This article has no evaluationsLatest version May 29, 2026
Rapid-PFP: Accelerating Prefix-Free Parsing with GPU Parallelism

This article has 5 authors:
1. Eddie Ferro
2. Tyler Pencinger
3. Oded Green
4. Mahsa Lotfollahi
5. Christina Boucher
This article has no evaluationsLatest version May 1, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Impact of Dynamic Voltage on GPU Energy Consumption for Real-Time Systems

Revisiting CPUs for Protein Folding: Xeon-Based Acceleration of AlphaFold2

Rapid-PFP: Accelerating Prefix-Free Parsing with GPU Parallelism