StrataServe: Hierarchical HBM–DRAM–SSD Parameter Serving for Distributed AI

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This paper presents a distributed AI training system that pools GPU high-bandwidth memory, host DRAM, and SSD into a coordinated parameter-serving hierarchy to support multiterabyte, sparsity-dominated deep models without sharing raw features across machines. The design shards and caches only the working parameters in GPU memory via multi-GPU hash tables, communicates intra-node over NVLink, and performs inter-node synchronization using RDMA-backed collective updates to preserve convergence under data parallelism. A four-stage pipeline overlaps network transfers, SSD I/O, CPU partitioning, and GPU compute while file-level compaction mitigates I/O amplification, yielding high throughput without inflating latency at scale. On industrial click-through-rate workloads with multi-terabyte embeddings, the system outperforms a large in-memory CPU cluster while maintaining production-grade accuracy, improving both training speed and price-performance for distributed AI. Overall, the architecture offers a pragmatic blueprint for scaling distributed learning through memory-hierarchy co-design and communication-aware parameter serving rather than brute-force cluster expansion.

Article activity feed