FlashServe: Cost-Efficient Serverless Inference Scheduling for Large Language Models via Tiered Memory Management and Predictive Autoscaling

Bolin Chen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Deploying Large Language Models (LLMs) in cloud environments presents significant challenges due to their substantial memory footprint and computational requirements. While serverless architectures offer attractive pay-per-use economics, they suffer from prohibitively long cold start times when loading multi-gigabyte model weights into GPU memory. This paper presents FlashServe, a serverless LLMinference system that achieves fast cold starts through three key innovations: (1) a tiered memory snapshotting mechanism that pre-stages model checkpoints in host DRAM and leverages high-speed DMAtransfers via PCIe for rapid GPU memory loading, (2) a hybrid Prophet-LSTM prediction model for proactive pod pre-warming based on request arrival patterns, and (3) efficient LoRA adapter multiplexing that enables serving multiple fine-tuned models on shared GPU resources. Extensive experiments on the Azure Functions trace dataset demonstrate that FlashServe reduces cold start latency by up to 49× compared to baseline S3-based loading approaches and by 3.3× compared to state-of-the-art systems like ServerlessLLM. Under realistic bursty workloads, FlashServe achieves 32% reduction in GPU idle costs while maintaining sub-second time-to-first-token (TTFT) latency for 95%of requests. These results demonstrate that FlashServe represents meaningful progress toward practical serverless LLM deployment.

Version published to 10.20944/preprints202512.1908.v1
Dec 22, 2025

Predictive-LoRA: A Proactive and Fragmentation-Aware Serverless Inference System for LLMs

This article has 5 authors:
1. Yinan Ni
2. Xiao Yang
3. Zhimin Qiu
4. Chen Wang
5. Tingzhou Yuan
This article has no evaluationsLatest version Dec 24, 2025
StrataServe: Hierarchical HBM–DRAM–SSD Parameter Serving for Distributed AI

This article has 1 author:
1. Yaswanth Sai Kamma
This article has no evaluationsLatest version Jan 15, 2026
Implementation and Evaluation of MemGuard in the Bao Hypervisor

This article has 2 authors:
1. Everaldo Gomes
2. Giovani Gracioli
This article has no evaluationsLatest version Jan 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Predictive-LoRA: A Proactive and Fragmentation-Aware Serverless Inference System for LLMs

StrataServe: Hierarchical HBM–DRAM–SSD Parameter Serving for Distributed AI

Implementation and Evaluation of MemGuard in the Bao Hypervisor