I/O for LLM Inference: A Survey of Storage and Memory Bottlenecks

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Deploying Large Language Models at scale has shifted the dominant bottleneck from compute during training to memory and I/O during inference. As parameter counts reach hundreds of billions and context windows stretch past a million tokens, latency and throughput are limited not by arithmetic but by data movement across the memory hierarchy. This survey decomposes inference I/O into three flows--- model weight I/O , Key-Value (KV) cache I/O , and activation I/O ---and uses roofline analysis to map each optimization to the memory-hierarchy level it targets. We cover quantization, PagedAttention, FlashAttention, speculative decoding, KV cache compression, and offloading, alongside system-level orchestration (continuous batching, disaggregated prefill-decode, prefix caching) and hardware trends (HBM scaling, CXL, processing-in-memory, unified memory). A composability analysis reveals that stacking optimizations causes the dominant bottleneck to oscillate between weight and KV cache I/O. We close by identifying open problems in unbounded-context scaling, expert caching, edge deployment, and I/O-aware benchmarking.

Article activity feed