Ephemeral Kubernetes: Dynamically Deleting and Recreating Clusters using Warewulf
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
With the rise of LLMs, GPU acceleration has become essential for both training and serving AI models.This requires HPC systems to be highly flexible with assigning multi-GPU nodes while also maintaining high security standards.Existing approaches involve utilizing nodes with batch and service schedulers, e.g., Slurm and Kubernetes, by dynamically moving nodes between the schedulers either through negotiation between the systems or via an external system.However, such a multi-use approach also increases the attack surface as more scheduling components operate with root permission.Moreover, it becomes increasingly difficult to recover from a security incident as attackers might have infected parts of either scheduling system.In this work, we present Ephemeral Kubernetes as a way to dynamically deploy and remove Kubernetes clusters in Warewulf managed environments such that nodes can be booted to be either part of a Slurm or Kubernetes cluster while being wiped at shutdown.