LinkShard: Communication-Centric Parameter Serving for Distributed AI
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper presents a distributed systems design for large-scale AI training that rebalances the communication/computation pipeline by co-designing a high-throughput parameter server and a network-aware software stack. The system shards model state across multiple high-speed interfaces per NUMA domain, uses RDMA for cross-node transfers, and applies fine-grained, vectorized gradient chunking to maximize overlap between aggregation and transport. A NUMA-aware memory layout and zero-copy data paths minimize synchronization and cache contention, while a streamlined update pipeline localizes data movement to preserve end-to-end throughput under data parallelism. The architecture is evaluated on modern vision workloads and cloud-like networks, demonstrating consistent speedups over sharded baselines without degrading model quality, with scaling bounded by PCIe/memory fabric limits rather than GPU compute. The design further outlines an in-rack aggregation path using programmable switching to compress cross-rack traffic, offering a pragmatic blueprint for AI training that treats the parameter server as a balanced I/O service—coordinating NICs, memory, and GPUs—rather than a monolithic compute node.