Dynamic Micro-Batch and Token-Budget Scheduling for IoT-Scale Pipeline-Parallel LLM Inference

Juncheol Ahn
Yubin Son
Daemin Kim
Sejin Park

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) are increasingly integrated into IoT–edge–cloud systems, where real-time analytics and natural-language interaction demand both high throughput and stable latency. However, IoT workloads are inherently bursty and heterogeneous: prompt and generation lengths vary widely, and prefill- and decode-heavy requests coexist. When served via pipeline-parallel LLM inference, these characteristics amplify micro-batch imbalance and communication stalls, leading to substantial GPU idle time and degraded TTFT/ITL service-level objectives (SLOs). We propose a runtime-adaptive scheduling framework that combines Dynamic Token-Budget Estimation with Dynamic Micro-Batch Scheduling. Unlike static token-budget settings—which act primarily as latency knobs—our approach dynamically adjusts token budgets to balance prefill/decode workloads across micro-batches, while selecting the optimal number of micro-batches to minimize pipeline bubbles under varying compute and network conditions. Implementing the framework on a four-node RTX 4070 cluster running pipeline-parallel Llama-2-13b-chat with vLLM, we evaluate both synthetic offline workloads and online Poisson-arrival workloads. The combined scheme reduces GPU idle time by up to 55% and improves throughput (completion time) by up to 1.61× compared with the baseline, while significantly increasing TTFT/ITL SLO satisfaction under bursty conditions. These results demonstrate that dynamic, workload-aware scheduling is essential for scalable and latency-stable LLM inference in IoT–edge–cloud environments.

Version published to 10.20944/preprints202512.0788.v1
Dec 9, 2025

FlashServe: Cost-Efficient Serverless Inference Scheduling for Large Language Models via Tiered Memory Management and Predictive Autoscaling

This article has 1 author:
1. Bolin Chen
This article has no evaluationsLatest version Dec 22, 2025
An Intelligent Green Controller for Dynamic Resource Provisioning in Heterogeneous Cloud–Edge IoT Systems

This article has 5 authors:
1. Kalpit Soni
2. Mubina Malik
3. Dhatri Raval
4. Unnati Patel
5. Atul Patel
This article has no evaluationsLatest version Jan 8, 2026
Latency-Aware Service Placement on the Fog--Edge--Cloud Continuum via Integer Programming

This article has 1 author:
1. Deo Shankar
This article has no evaluationsLatest version Feb 4, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

FlashServe: Cost-Efficient Serverless Inference Scheduling for Large Language Models via Tiered Memory Management and Predictive Autoscaling

An Intelligent Green Controller for Dynamic Resource Provisioning in Heterogeneous Cloud–Edge IoT Systems

Latency-Aware Service Placement on the Fog--Edge--Cloud Continuum via Integer Programming