LagTuner: Adaptive Staleness Orchestration for Parameter-Server AI Training

Yaswanth Sai Kamma

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This paper introduces a server-driven orchestration layer for parameter-server training that adaptively bounds iteration skew among workers at runtime, using recent pushtimestamp telemetry to minimize straggler waiting while preserving convergence in distributed AI systems. The mechanism selects per-iteration, per-worker staleness allowances within a configurable band, turning gradient exchange into a feedbackcontrolled service that balances throughput and consistency under both homogeneous and heterogeneous GPU clusters. A formal analysis establishes convergence guarantees comparable to bounded-staleness methods via an O( √ T) regret bound, aligning systems control with algorithmic stability for large-scale training. A reference implementation in MXNet integrates worker/server procedures and a synchronization controller that simulates nearterm iteration timelines, granting extra steps to the current fastest worker only when it minimizes projected wait time. Empirically, on CIFAR-10/100 with AlexNet and ResNet variants across multiGPU, multi-node deployments, the approach accelerates time-toaccuracy versus bulk-synchronous and fixed-staleness baselines, while matching the agility of asynchronous execution without its instability risks. The results position adaptive staleness control as a practical distributed-systems primitive—coordinating parameter exchange through runtime telemetry to sustain high iteration throughput with robust convergence in production AI training pipelines.

Version published to 10.20944/preprints202601.1210.v1
Jan 16, 2026

RackWeave: Hierarchical Gradient Exchange for Distributed AI

This article has 1 author:
1. Vipul Razdan
This article has no evaluationsLatest version Jan 5, 2026
LinkShard: Communication-Centric Parameter Serving for Distributed AI

This article has 1 author:
1. Vipul Razdan
This article has no evaluationsLatest version Jan 7, 2026
Data-Centric Serverless Computing with LambdaStore

This article has 5 authors:
1. Kai Mast
2. Suyan Qu
3. Aditya Jain
4. Andrea Arpaci-Dusseau
5. Remzi Arpaci-Dusseau
This article has no evaluationsLatest version Jan 21, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

RackWeave: Hierarchical Gradient Exchange for Distributed AI

LinkShard: Communication-Centric Parameter Serving for Distributed AI

Data-Centric Serverless Computing with LambdaStore