Network Communication Debuggability and Observability at Scale

Ashmitha Jeevaraj Shetty

Min Si

TOPIC: Data, Systems and Networking

@SCALE SERIES: Networking @Scale

TYPE: video

YEAR: 2024

TAGS:

Training large models like Llama3 at scale introduces unprecedented challenges in terms of failures and performance degradation. Existing monitoring techniques fall short in providing the necessary insights for effective troubleshooting. This talk presents our experience in instrumenting the collective library to gain deeper visibility into network operations. We will share techniques for real-time performance tracing, grouping these operations into meaningful cohorts and analyzing their impact on model training. By co-relating model performance with network bottlenecks we can identify issues like slow ranks and also hard failures, ultimately improving training efficiency and reliability.