Trace-Driven HPC Scheduling with Runtime Prediction: A Reproducible Study of Backfilling, Uncertainty, and Fairness Constraints

Vinish Kumar

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This paper studies whether runtime prediction can materially improve queueing outcomes in HPC scheduling when evaluated on real public workload traces rather than on synthetic workloads or isolated model metrics. We build a reproducible trace-driven pipeline that normalizes public traces into a unified job schema, trains submission-time runtime predictors, and evaluates FCFS, EASY backfilling, and ML-augmented EASY variants under a common simulator. The study uses three public 5,000-job subsets drawn from JSSPP-hosted workload archives: CERIT grid jobs, MetaCentrum 2013, and CERIT soft-wall. Each trace is split temporally into 3,000 training jobs and 2,000 evaluation jobs. We compare a hierarchical median baseline, k-nearest neighbors, and a lightweight gradient-boosting regressor with calibrated prediction intervals. On the two heavier traces, the best scheduler is kNN-assisted EASY, which reduces mean wait time by 57.5% and 49.5% relative to FCFS and by 37.0% and 18.8% relative to EASY with user-requested runtimes. Gradient boosting achieves the best runtime-prediction MAE on two of three traces and the highest interval coverage on all three traces, but it does not consistently yield the best scheduling policy. This gap between prediction accuracy and scheduling utility is the main empirical finding. On the low-utilization CERIT soft-wall trace, all policies collapse to identical schedules, highlighting workload sensitivity. We also report ablations over feature sets, cold-start handling, and fairness constraints. The fairness-constrained scheduler exposes a throughput-fairness tradeoff but does not yet improve the current per-user mean-wait fairness gap metric, indicating that better alignment between control constraints and fairness objectives remains open.

Version published to 10.21203/rs.3.rs-9372280/v1 on Research Square
Apr 14, 2026

DeepServe: SLO-Aware and Cost-Aware Elastic Scheduling for Serverless Multi-Tenant LLM Inference

This article has 5 authors:
1. Xuexian Li
2. Xiayuan Liu
3. Zilong Wang
4. Chun-Yao Hsieh
5. Yixue Liu
This article has no evaluationsLatest version Apr 7, 2026
A Hybrid Graph–Markov Model for Workload Generation in Load Testing

This article has 1 author:
1. Dara Surya Varaprakash
This article has no evaluationsLatest version Apr 14, 2026
Stabilizing Cloud Elastic Scaling with Risk-Constrained Reinforcement Learning Under Workload Drift

This article has 6 authors:
1. Wen Huang
2. Ruoxuan Wei
3. Junnan Kou
4. Hong Zhuang
5. Xu Yan
6. Wenyou Huang
This article has no evaluationsLatest version Apr 13, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

DeepServe: SLO-Aware and Cost-Aware Elastic Scheduling for Serverless Multi-Tenant LLM Inference

A Hybrid Graph–Markov Model for Workload Generation in Load Testing

Stabilizing Cloud Elastic Scaling with Risk-Constrained Reinforcement Learning Under Workload Drift