Building at Scale with H100: Eos as a DGX SuperPOD Reference Model for Large Data Center Builds | Julie Bernauer
With language models getting larger, building compute infrastructure needs to handle both reliability and performance at unprecedented scales. In addition to having a large number of GPUs working together, the platform needs to provide guarantees on fabric and IO performance and stability, but also ensure software is architected to enable consistency and reliability from workload launching, job scheduling, and monitoring. In this talk, we will describe how Eos was built to leverage a H100 reference cluster architecture.
Негізгі бет Ғылым және технология Building at Scale with H100: Eos as a DGX SuperPOD Reference Model for Large Data Center Builds
Пікірлер