June 17, 2024

BUILDING AT SCALE WITH H100: EOS AS A DGX SUPERPOD REFERENCE MODEL FOR LARGE DATA CENTER BUILDS

Topic: Systems and Networking

Julie Bernauer

NVIDIA

TYPE: Videos

YEAR: 2024

With language models getting larger, building compute infrastructure needs to handle both reliability and performance at unprecedented scales. In addition to having a large number of GPUs working together, the platform needs to provide guarantees on fabric and IO performance and stability, but also ensure software is architected to enable consistency and reliability from workload launching, job scheduling, and monitoring. In this talk, we will describe how Eos was built to leverage a H100 reference cluster architecture.

SUBSCRIBE TO @SCALE

← Back

BUILDING AT SCALE WITH H100: EOS AS A DGX SUPERPOD REFERENCE MODEL FOR LARGE DATA CENTER BUILDS

Julie Bernauer

TYPE: Videos

YEAR: 2024

SUBSCRIBE TO @SCALE

Thank you for your response. ✨

RECENT POSTS

RELATED POSTS