Scaling RoCE Networks for AI Training

In this talk we provide an overview of Meta’s RDMA deployment based on RoCEV2 transport for supporting our production AI Training infrastructure. We will shed light on how we designed our infrastructure to both maximize raw performance and consistency that is fundamental for the workload. We will talk about the challenges we solved in Routing, Transport and Hardware layers we solved along the way to scale our infrastructure. We will also touch on opportunities that remain in this space to make further progress over the next few years.

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy