BUILDING AT SCALE WITH H100: EOS AS A DGX SUPERPOD REFERENCE MODEL FOR LARGE DATA CENTER BUILDS

Julie Bernauer

NVIDIA

TOPIC: Systems and Networking

@SCALE SERIES: Systems and Networking

TYPE: video

YEAR: 2024

TAGS:

With language models getting larger, building compute infrastructure needs to handle both reliability and performance at unprecedented scales. In addition to having a large number of GPUs working together, the platform needs to provide guarantees on fabric and IO performance and stability, but also ensure software is architected to enable consistency and reliability from workload launching, job scheduling, and monitoring. In this talk, we will describe how Eos was built to leverage a H100 reference cluster architecture.

SUBSCRIBE TO @SCALE

← Back