Queue wait time prediction in High PerformanceComputing (HPC) Systems

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

High Performance Computing (HPC) systems are critical enablers for groundbreaking scientific research across various domains. Efficient resource allocation, facilitated by job scheduling, is paramount for maximizing the utilization of HPC systems. However, the variability in wait times for queued jobs poses challenges for users, necessitating accurate job wait time estimation. This paper explores the influence of job characteristics, including job size (the number of nodes requested and walltime), the queue to which the job is submitted and other resource requirements, on job wait times in leadership-class HPC systems. Focusing on the Theta Cray XC40 and Polaris machines at Argonne National Laboratory, the study evaluates the performance of different supervised learning algorithms in predicting job wait times. It underscores the importance of effective data handling and processing, including outlier detection and Principal Component Analysis (PCA), to enhance prediction accuracy.The findings reveal insights into the relationship between job characteristics and wait times, offering a foundation for optimizing resource allocation and enhancing user experience. The methodologies and tools developed in this study are adaptable to other leadership-class HPC systems, providing a valuable contribution to the broader HPC community aiming to improve job scheduling efficiency and user satisfaction.

Article activity feed