StrataServe: Hierarchical HBM–DRAM–SSD Parameter Serving for Distributed AI

Yaswanth Sai Kamma

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This paper presents a distributed AI training system that pools GPU high-bandwidth memory, host DRAM, and SSD into a coordinated parameter-serving hierarchy to support multiterabyte, sparsity-dominated deep models without sharing raw features across machines. The design shards and caches only the working parameters in GPU memory via multi-GPU hash tables, communicates intra-node over NVLink, and performs inter-node synchronization using RDMA-backed collective updates to preserve convergence under data parallelism. A four-stage pipeline overlaps network transfers, SSD I/O, CPU partitioning, and GPU compute while file-level compaction mitigates I/O amplification, yielding high throughput without inflating latency at scale. On industrial click-through-rate workloads with multi-terabyte embeddings, the system outperforms a large in-memory CPU cluster while maintaining production-grade accuracy, improving both training speed and price-performance for distributed AI. Overall, the architecture offers a pragmatic blueprint for scaling distributed learning through memory-hierarchy co-design and communication-aware parameter serving rather than brute-force cluster expansion.

Version published to 10.20944/preprints202601.1192.v1
Jan 15, 2026

LinkShard: Communication-Centric Parameter Serving for Distributed AI

This article has 1 author:
1. Vipul Razdan
This article has no evaluationsLatest version Jan 7, 2026
RackWeave: Hierarchical Gradient Exchange for Distributed AI

This article has 1 author:
1. Vipul Razdan
This article has no evaluationsLatest version Jan 5, 2026
FlashServe: Cost-Efficient Serverless Inference Scheduling for Large Language Models via Tiered Memory Management and Predictive Autoscaling

This article has 1 author:
1. Bolin Chen
This article has no evaluationsLatest version Dec 22, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

LinkShard: Communication-Centric Parameter Serving for Distributed AI

RackWeave: Hierarchical Gradient Exchange for Distributed AI

FlashServe: Cost-Efficient Serverless Inference Scheduling for Large Language Models via Tiered Memory Management and Predictive Autoscaling