September 08, 2023

Scaling RoCE Networks for AI Training

Adi Gangidi

TYPE: Videos

YEAR: 2023

In this talk we provide an overview of Meta’s RDMA deployment based on RoCEV2 transport for supporting our production AI Training infrastructure. We will shed light on how we designed our infrastructure to both maximize raw performance and consistency that is fundamental for the workload. We will talk about the challenges we solved in Routing, Transport and Hardware layers we solved along the way to scale our infrastructure. We will also touch on opportunities that remain in this space to make further progress over the next few years.

SUBSCRIBE TO @SCALE

← Back

Scaling RoCE Networks for AI Training

Adi Gangidi

TYPE: Videos

YEAR: 2023

SUBSCRIBE TO @SCALE

Thank you for your response. ✨

RECENT POSTS

RELATED POSTS