LinkShard: Communication-Centric Parameter Serving for Distributed AI

Vipul Razdan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This paper presents a distributed systems design for large-scale AI training that rebalances the communication/computation pipeline by co-designing a high-throughput parameter server and a network-aware software stack. The system shards model state across multiple high-speed interfaces per NUMA domain, uses RDMA for cross-node transfers, and applies fine-grained, vectorized gradient chunking to maximize overlap between aggregation and transport. A NUMA-aware memory layout and zero-copy data paths minimize synchronization and cache contention, while a streamlined update pipeline localizes data movement to preserve end-to-end throughput under data parallelism. The architecture is evaluated on modern vision workloads and cloud-like networks, demonstrating consistent speedups over sharded baselines without degrading model quality, with scaling bounded by PCIe/memory fabric limits rather than GPU compute. The design further outlines an in-rack aggregation path using programmable switching to compress cross-rack traffic, offering a pragmatic blueprint for AI training that treats the parameter server as a balanced I/O service—coordinating NICs, memory, and GPUs—rather than a monolithic compute node.

Version published to 10.31224/6165
Jan 7, 2026

RackWeave: Hierarchical Gradient Exchange for Distributed AI

This article has 1 author:
1. Vipul Razdan
This article has no evaluationsLatest version Jan 5, 2026
StrataServe: Hierarchical HBM–DRAM–SSD Parameter Serving for Distributed AI

This article has 1 author:
1. Yaswanth Sai Kamma
This article has no evaluationsLatest version Jan 15, 2026
FlashServe: Cost-Efficient Serverless Inference Scheduling for Large Language Models via Tiered Memory Management and Predictive Autoscaling

This article has 1 author:
1. Bolin Chen
This article has no evaluationsLatest version Dec 22, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

RackWeave: Hierarchical Gradient Exchange for Distributed AI

StrataServe: Hierarchical HBM–DRAM–SSD Parameter Serving for Distributed AI

FlashServe: Cost-Efficient Serverless Inference Scheduling for Large Language Models via Tiered Memory Management and Predictive Autoscaling