DeepServe: SLO-Aware and Cost-Aware Elastic Scheduling for Serverless Multi-Tenant LLM Inference

Xuexian Li
Xiayuan Liu
Zilong Wang
Chun-Yao Hsieh
Yixue Liu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Deploying large language model (LLM) inference services in serverless, multi-tenant environments present compounding challenges: cold start latency, GPU memory fragmentation, inter-tenant resource contention, and unpredictable tail latency. Existing systems optimize individual aspects but fail to jointly address service-level objective (SLO) compliance and cost efficiency under dynamic, heterogeneous workloads. We present DeepServe++, an elastic scheduling framework that formulates the joint SLO--cost optimization as a contextual bandit problem. The system introduces a Request Profiler that extracts online features---prompt length, historical KV-cache hit ratio, and predicted generation length---and feeds them into a contextual bandit agent that adaptively selects batch sizes, concurrency levels, KV-cache eviction policies, and warm-standby strategies. We evaluate DeepServe++ on ShareGPT and BurstGPT traces using LLaMA-2-13B and Mixtral-8x7B models on NVIDIA A100 GPUs. Results show that DeepServe++ reduces P99 latency by 38--62% compared to state-of-the-art baselines while improving GPU utilization by 14--23% and reducing per-request cost by up to 27%, with only a modest increase in scheduling overhead.

Version published to 10.21203/rs.3.rs-9317057/v1 on Research Square
Apr 7, 2026

Trace-Driven HPC Scheduling with Runtime Prediction: A Reproducible Study of Backfilling, Uncertainty, and Fairness Constraints

This article has 1 author:
1. Vinish Kumar
This article has no evaluationsLatest version Apr 14, 2026
ConsultChain: Progressive Context Distillation Across Heterogeneous LLM Fleets for Token-Optimal Inference

This article has 1 author:
1. Samuel Edusa
This article has no evaluationsLatest version Apr 13, 2026
Dimension-Direct Routing: Achieving 25% Depth Improvement in Multi- Model LLM Systems via Explicit Capability Factorization

This article has 1 author:
1. Tao Rui
This article has no evaluationsLatest version Apr 7, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Trace-Driven HPC Scheduling with Runtime Prediction: A Reproducible Study of Backfilling, Uncertainty, and Fairness Constraints

ConsultChain: Progressive Context Distillation Across Heterogeneous LLM Fleets for Token-Optimal Inference

Dimension-Direct Routing: Achieving 25% Depth Improvement in Multi- Model LLM Systems via Explicit Capability Factorization