LinkShard: Communication-Centric Parameter Serving for Distributed AI

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This paper presents a distributed systems design for large-scale AI training that rebalances the communication/computation pipeline by co-designing a high-throughput parameter server and a network-aware software stack. The system shards model state across multiple high-speed interfaces per NUMA domain, uses RDMA for cross-node transfers, and applies fine-grained, vectorized gradient chunking to maximize overlap between aggregation and transport. A NUMA-aware memory layout and zero-copy data paths minimize synchronization and cache contention, while a streamlined update pipeline localizes data movement to preserve end-to-end throughput under data parallelism. The architecture is evaluated on modern vision workloads and cloud-like networks, demonstrating consistent speedups over sharded baselines without degrading model quality, with scaling bounded by PCIe/memory fabric limits rather than GPU compute. The design further outlines an in-rack aggregation path using programmable switching to compress cross-rack traffic, offering a pragmatic blueprint for AI training that treats the parameter server as a balanced I/O service—coordinating NICs, memory, and GPUs—rather than a monolithic compute node.

Article activity feed