Predictive-LoRA: A Proactive and Fragmentation-Aware Serverless Inference System for LLMs
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The serverless computing paradigm offers compelling advantages for de- ploying Large Language Model (LLM) inference services, including elas- tic scaling and pay-per-use billing. However, serving multiple fine-tuned LLMs via Low-Rank Adaptation (LoRA) in serverless environments faces critical challenges: reactive adapter loading causes significant cold start la- tency, and frequent adapter swapping leads to severe GPU memory fragmen- tation. In this paper, we present Predictive-LoRA (P-LoRA), a proactive and fragmentation-aware serverless inference system for LoRA-based LLMs. P- LoRA introduces two key innovations: (1) a lightweight LSTM-based traf- fic predictor that forecasts adapter demand and proactively prefetches hot adapters from host memory to GPU, reducing cold start latency by up to 68%; and (2) a page-based adapter memory management mechanism in- spired by operating system virtual memory, which keeps GPU memory uti- lization above 87% even under heterogeneous adapter ranks. We evaluate P-LoRA using production-like workloads derived from the Azure Functions trace. Experimental results demonstrate that P-LoRA achieves 1.52 × higher throughput than S-LoRA while reducing the average Time-To-First-Token (TTFT) by 35% under high concurrency scenarios.