Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache

Hanchen Li
Yuhan Liu
Yihua Cheng
Kuntai Du
Junchen Jiang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Across large language model (LLM) applications, we observe an emerging trend for reusing KV caches to save the prefill delays of processing repeated input texts in different LLM inputs. This has led to a broad design space, including colocating stored KV caches with (or close to) GPUs to various KV cache compression. However, a key question remains unanswered: can these delay reductions also be economically favorable? Specifically, we ask whether a developer can use public cloud services to store precomputed KV caches and reuse them to save delay without incurring more costs in terms of compute, storage, and network. To answer this question, we propose an validated analytical model for the cloud cost (in compute, storage, and network) of storing and reusing KV caches based on various workload parameters, such as reuse frequency, generated text lengths, model sizes, etc. Preliminary results show that KV cache reusing is able to save both delay and cloud cost across a range of workloads with long context. And we call more efforts on building more economical context augmented LLM by KV cache reusing.

Version published to 10.32388/xi1064
Apr 9, 2025

FlashServe: Cost-Efficient Serverless Inference Scheduling for Large Language Models via Tiered Memory Management and Predictive Autoscaling

This article has 1 author:
1. Bolin Chen
This article has no evaluationsLatest version Dec 22, 2025
Implementation and Evaluation of MemGuard in the Bao Hypervisor

This article has 2 authors:
1. Everaldo Gomes
2. Giovani Gracioli
This article has no evaluationsLatest version Jan 19, 2026
Parallel Architectures for Large - Scale Document Processing:Integrating OCR and RAG Pipelines

This article has 4 authors:
1. Alejandro Jaime
2. Veronica Gil-Costa
3. Marcelo Errecalde
4. Leticia Cagnina
This article has no evaluationsLatest version Jan 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

FlashServe: Cost-Efficient Serverless Inference Scheduling for Large Language Models via Tiered Memory Management and Predictive Autoscaling

Implementation and Evaluation of MemGuard in the Bao Hypervisor

Parallel Architectures for Large - Scale Document Processing:Integrating OCR and RAG Pipelines