High-performance and reliable collective communication over AI-Zone RDMA network, is foundational for enabling and scaling Meta AI training / inference workloads. It is necessary to capture top-down observability from workload to network for collective communication, and therefore attribute performance regression and training failures to backend network. For this purpose, we introduced two important tools: ROCET and PARAM benchmark and Chakra ecosystems. We build ROCET to associates the job to RDMA network metrics and provide analysis on top. In addition, we build PARAM benchmark to allow analyzing and tuning collective communication operations through workload trace, and recently scale them to the community with Chakra for co-designing efficient distributed ML systems. In this talk, we will go over their design and use cases.
- WATCH NOW
- 2024 EVENTS
- PAST EVENTS
- 2023
- 2022
- February
- RTC @Scale 2022
- March
- Systems @Scale Spring 2022
- April
- Product @Scale Spring 2022
- May
- Data @Scale Spring 2022
- June
- Systems @Scale Summer 2022
- Networking @Scale Summer 2022
- August
- Reliability @Scale Summer 2022
- September
- AI @Scale 2022
- November
- Networking @Scale Fall 2022
- Video @Scale Fall 2022
- December
- Systems @Scale Winter 2022
- 2021
- 2020
- 2019
- 2018
- 2017
- 2016
- 2015
- Blog & Video Archive
- Speaker Submissions