RackWeave: Hierarchical Gradient Exchange for Distributed AI
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper reimagines model update flow in dataparallel training as a balanced I/O service co-designed across NICs, memory hierarchies, and CPUs to overcome the communication bottlenecks that arise when accelerators outpace network bandwidth in distributed AI systems. The architecture slices model state into fine-grained chunks, drives per-core aggregation and optimization pipelines with NUMA-aware buffers, and employs zero-copy RDMA over multiple high-speed interfaces to maximize overlap between transport and computation without cross-core contention. By anchoring a gradient exchange node at the top-of-rack and composing it with hierarchical crossrack coordination, the design confines most traffic within the rack and minimizes oversubscribed core traversal during synchronization. The implementation interoperates with mainstream training stacks while restoring compute-bound behavior through communication-aware chunk mapping, streaming aggregation, and streamlined update paths. Experiments on representative vision workloads under cloud-like networks demonstrate consistent throughput and cost-efficiency gains versus sharded baselines while preserving accuracy, with scalability bounded by memory/PCIe fabric limits rather than GPU compute. Together, these mechanisms provide a practical template for rack-centric distributed AI training where gradient exchange is treated as a first-class, balanced rack resource instead of a colocated afterthought.