Network Communication Debuggability and Observability at Scale

Training large models like Llama3 at scale introduces unprecedented challenges in terms of failures and performance degradation. Existing monitoring techniques fall short in providing the necessary insights for effective troubleshooting. This talk presents our experience in instrumenting the collective library to gain deeper visibility into network operations. We will share techniques for real-time performance tracing, grouping these operations into meaningful cohorts and analyzing their impact on model training. By co-relating model performance with network bottlenecks we can identify issues like slow ranks and also hard failures, ultimately improving training efficiency and reliability.


To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy