I/O for LLM Inference: A Survey of Storage and Memory Bottlenecks

Rajarshi Chowdhury

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Deploying Large Language Models at scale has shifted the dominant bottleneck from compute during training to memory and I/O during inference. As parameter counts reach hundreds of billions and context windows stretch past a million tokens, latency and throughput are limited not by arithmetic but by data movement across the memory hierarchy. This survey decomposes inference I/O into three flows--- model weight I/O , Key-Value (KV) cache I/O , and activation I/O ---and uses roofline analysis to map each optimization to the memory-hierarchy level it targets. We cover quantization, PagedAttention, FlashAttention, speculative decoding, KV cache compression, and offloading, alongside system-level orchestration (continuous batching, disaggregated prefill-decode, prefix caching) and hardware trends (HBM scaling, CXL, processing-in-memory, unified memory). A composability analysis reveals that stacking optimizations causes the dominant bottleneck to oscillate between weight and KV cache I/O. We close by identifying open problems in unbounded-context scaling, expert caching, edge deployment, and I/O-aware benchmarking.

Version published to 10.21203/rs.3.rs-9036613/v1 on Research Square
Mar 19, 2026

Beyond All-Reduce: Event-Driven Model Parallelism Without Collective Communication Primitives (EBD2N)

This article has 4 authors:
1. Ernesto Leite
2. Fabrice Mourlin
3. Youakim Badr
4. Pierre Paradinas
This article has no evaluationsLatest version Mar 5, 2026
DeepServe: SLO-Aware and Cost-Aware Elastic Scheduling for Serverless Multi-Tenant LLM Inference

This article has 5 authors:
1. Xuexian Li
2. Xiayuan Liu
3. Zilong Wang
4. Chun-Yao Hsieh
5. Yixue Liu
This article has no evaluationsLatest version Apr 7, 2026
PlanCompiler: A Deterministic Compilation Architecture for Structured Multi-Step LLM Pipelines

This article has 1 author:
1. Pranav Harikumar
This article has no evaluationsLatest version Mar 24, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Beyond All-Reduce: Event-Driven Model Parallelism Without Collective Communication Primitives (EBD2N)

DeepServe: SLO-Aware and Cost-Aware Elastic Scheduling for Serverless Multi-Tenant LLM Inference

PlanCompiler: A Deterministic Compilation Architecture for Structured Multi-Step LLM Pipelines