Meta’s Network Journey to Enable AI

Over the years, Meta’s AI infrastructure has undergone a remarkable transformation, transitioning from CPU-based training to GPU-based training within the same host, and ultimately adopting distributed systems interconnected by a network. Today, our model training heavily relies on a RoCE-based network fabric with a CLOS topology, where leaf switches are connected to GPU hosts and spine switches provide the Scale-Out connectivity to GPUs in the cluster. This presentation will delve into the progressive evolution of our network builds, specifically tailored to support the demanding requirements of AI services. Attendees will gain insights into the challenges encountered, innovative solutions implemented, and the strategic considerations behind building an efficient and high-performance fabric for AI workloads at Meta.

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy