Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon Europe in Paris from March 19-22, 2024. Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at kubecon.io
On-Demand Systems and Scaled Training Using the JobSet API - Abdullah Gharaibeh, Google & Vanessa Sochat, Lawrence Livermore National Laboratory
Orchestrating complex workflows with heterogeneous components presents challenges that are compounded in ephemeral environments. For example, training of large ML models requires efficiently managing a significant number of expensive accelerators, and building on-demand HPC systems can mean composing applications and services. For both, efficient job orchestration is critical to ensure scalability and high resource utilization. This talk introduces the JobSet API (sigs.k8s.io/jobset) that lays the foundation to automate the setup of these designs. We will first demonstrate how JobSet is used to deploy training workloads using common frameworks like Pytorch, and present results from large scale training experiments on thousands of TPU chips. We then show using JobSet to automate the arduous task of setting up HPC systems on-demand, and creating common environments for experimental comparison.
Негізгі бет On-Demand Systems and Scaled Training Using the JobSet API - Abdullah Gharaibeh & Vanessa Sochat
Пікірлер