SEPTEMBER 07, 2023

Network Observability for AI/HPC Training Workflows

Shengbao Zheng

TOPIC: Data, Systems and Networking

@SCALE SERIES: Networking @Scale

TYPE: video

YEAR: 2023

TAGS:

High-performance and reliable collective communication over AI-Zone RDMA network, is foundational for enabling and scaling Meta AI training / inference workloads. It is necessary to capture top-down observability from workload to network for collective communication, and therefore attribute performance regression and training failures to backend network. For this purpose, we introduced two important tools: ROCET and PARAM benchmark and Chakra ecosystems. We build ROCET to associates the job to RDMA network metrics and provide analysis on top. In addition, we build PARAM benchmark to allow analyzing and tuning collective communication operations through workload trace, and recently scale them to the community with Chakra for co-designing efficient distributed ML systems. In this talk, we will go over their design and use cases.

SUBSCRIBE TO @SCALE

TOPICS

Data, Systems and Networking Dev Tools and Ops, Privacy, Sustainability and Performance Fighting Abuse and Security Machine Learning and AI Mobile, Video and Web