JUNE 12, 2024

BUILDING AT SCALE WITH H100: EOS AS A DGX SUPERPOD REFERENCE MODEL FOR LARGE DATA CENTER BUILDS

With language models getting larger, building compute infrastructure needs to handle both reliability and performance at unprecedented scales. In addition to having a large number of GPUs working together, the platform needs to provide guarantees on fabric and IO performance and stability, but also ensure software is architected to enable consistency and reliability from workload launching, job scheduling, and monitoring. In this talk, we will describe how Eos was built to leverage a H100 reference cluster architecture.


To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy