I/O for LLM Inference: A Survey of Storage and Memory Bottlenecks
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Deploying Large Language Models at scale has shifted the dominant bottleneck from compute during training to memory and I/O during inference. As parameter counts reach hundreds of billions and context windows stretch past a million tokens, latency and throughput are limited not by arithmetic but by data movement across the memory hierarchy. This survey decomposes inference I/O into three flows--- model weight I/O , Key-Value (KV) cache I/O , and activation I/O ---and uses roofline analysis to map each optimization to the memory-hierarchy level it targets. We cover quantization, PagedAttention, FlashAttention, speculative decoding, KV cache compression, and offloading, alongside system-level orchestration (continuous batching, disaggregated prefill-decode, prefix caching) and hardware trends (HBM scaling, CXL, processing-in-memory, unified memory). A composability analysis reveals that stacking optimizations causes the dominant bottleneck to oscillate between weight and KV cache I/O. We close by identifying open problems in unbounded-context scaling, expert caching, edge deployment, and I/O-aware benchmarking.