Traffic Engineering for AI Training Networks

Meta has been operating RoCE-based distributed training clusters serving internal AI training workloads since 2020. One major challenge surfaced in the early days was the job performance inconsistency over different job scheduling schemes and network failures. This was attributed to the static routing scheme we employed and triggered us to proceed on multiple paths to address them.

Centralized Traffic Engineering, which dynamically places traffic over all available paths in a load balanced manner, is one of the most promising solutions we have adopted to address the challenge. In this talk, we will go over the design, development, evaluation, and operational experience of the centralized traffic engineering solution.

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy