Enhancing HPC Job Run Time Predictions leveraging Machine Learning, Historical Job Data, and Metaheuristic Optimization
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In High Performance Computing (HPC) systems that rely on job schedulers for resource allocation, accurate predictions of job run times is critical for efficient resource utilization. This study attempts to develop data-driven machine learning (ML) models for predicting job run times using three algorithms - Light Gradient Boosting (LGB), Deep Neural Networks (DNN), and Extreme Gradient Boosting (XGB). Unlike approaches that rely solely on job metadata, this work incorporates historical run times of previously executed similar jobs as features, capturing user-specific behavioral patterns and analyzes their impact in improving prediction accuracy. Two metaheuristic algorithms - Genetic Algorithm and Whale Optimization Algorithm are leveraged to optimize the performance of models by selecting a relevant feature subset, hyperparameters and pre-processing techniques. The algorithms employ a sequential selection strategy for historical features, striving to achieve a balance between prediction accuracy and computational cost of extracting the features. Baseline versions of ML models are implemented with Bayesian hyperparameter optimization and embedded feature selection techniques (Ridge Regression and Random Forest). The models are validated using four datasets collected from multiple, heterogeneous HPC systems to ensure adaptability to diverse HPC configurations and workloads. The results show that ML models optimized by metaheuristics consistently outperform baseline models. Optimized XGB models achieved the highest prediction accuracy while using fewer historical features. This work underscores the significance of integrating ML models, historical data and intelligent optimization techniques in developing accurate, efficient and generalized HPC job run time prediction models.