DeepServe: SLO-Aware and Cost-Aware Elastic Scheduling for Serverless Multi-Tenant LLM Inference

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Deploying large language model (LLM) inference services in serverless, multi-tenant environments present compounding challenges: cold start latency, GPU memory fragmentation, inter-tenant resource contention, and unpredictable tail latency. Existing systems optimize individual aspects but fail to jointly address service-level objective (SLO) compliance and cost efficiency under dynamic, heterogeneous workloads. We present DeepServe++, an elastic scheduling framework that formulates the joint SLO--cost optimization as a contextual bandit problem. The system introduces a Request Profiler that extracts online features---prompt length, historical KV-cache hit ratio, and predicted generation length---and feeds them into a contextual bandit agent that adaptively selects batch sizes, concurrency levels, KV-cache eviction policies, and warm-standby strategies. We evaluate DeepServe++ on ShareGPT and BurstGPT traces using LLaMA-2-13B and Mixtral-8x7B models on NVIDIA A100 GPUs. Results show that DeepServe++ reduces P99 latency by 38--62% compared to state-of-the-art baselines while improving GPU utilization by 14--23% and reducing per-request cost by up to 27%, with only a modest increase in scheduling overhead.

Article activity feed