RackWeave: Hierarchical Gradient Exchange for Distributed AI

Vipul Razdan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This paper reimagines model update flow in dataparallel training as a balanced I/O service co-designed across NICs, memory hierarchies, and CPUs to overcome the communication bottlenecks that arise when accelerators outpace network bandwidth in distributed AI systems. The architecture slices model state into fine-grained chunks, drives per-core aggregation and optimization pipelines with NUMA-aware buffers, and employs zero-copy RDMA over multiple high-speed interfaces to maximize overlap between transport and computation without cross-core contention. By anchoring a gradient exchange node at the top-of-rack and composing it with hierarchical crossrack coordination, the design confines most traffic within the rack and minimizes oversubscribed core traversal during synchronization. The implementation interoperates with mainstream training stacks while restoring compute-bound behavior through communication-aware chunk mapping, streaming aggregation, and streamlined update paths. Experiments on representative vision workloads under cloud-like networks demonstrate consistent throughput and cost-efficiency gains versus sharded baselines while preserving accuracy, with scalability bounded by memory/PCIe fabric limits rather than GPU compute. Together, these mechanisms provide a practical template for rack-centric distributed AI training where gradient exchange is treated as a first-class, balanced rack resource instead of a colocated afterthought.

Version published to 10.31224/6164
Jan 5, 2026

LinkShard: Communication-Centric Parameter Serving for Distributed AI

This article has 1 author:
1. Vipul Razdan
This article has no evaluationsLatest version Jan 7, 2026
StrataServe: Hierarchical HBM–DRAM–SSD Parameter Serving for Distributed AI

This article has 1 author:
1. Yaswanth Sai Kamma
This article has no evaluationsLatest version Jan 15, 2026
LagTuner: Adaptive Staleness Orchestration for Parameter-Server AI Training

This article has 1 author:
1. Yaswanth Sai Kamma
This article has no evaluationsLatest version Jan 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

LinkShard: Communication-Centric Parameter Serving for Distributed AI

StrataServe: Hierarchical HBM–DRAM–SSD Parameter Serving for Distributed AI

LagTuner: Adaptive Staleness Orchestration for Parameter-Server AI Training