StrataServe: Hierarchical HBM–DRAM–SSD Parameter Serving for Distributed AI
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper presents a distributed AI training system that pools GPU high-bandwidth memory, host DRAM, and SSD into a coordinated parameter-serving hierarchy to support multiterabyte, sparsity-dominated deep models without sharing raw features across machines. The design shards and caches only the working parameters in GPU memory via multi-GPU hash tables, communicates intra-node over NVLink, and performs inter-node synchronization using RDMA-backed collective updates to preserve convergence under data parallelism. A four-stage pipeline overlaps network transfers, SSD I/O, CPU partitioning, and GPU compute while file-level compaction mitigates I/O amplification, yielding high throughput without inflating latency at scale. On industrial click-through-rate workloads with multi-terabyte embeddings, the system outperforms a large in-memory CPU cluster while maintaining production-grade accuracy, improving both training speed and price-performance for distributed AI. Overall, the architecture offers a pragmatic blueprint for scaling distributed learning through memory-hierarchy co-design and communication-aware parameter serving rather than brute-force cluster expansion.