As large language model (LLM) training scales across tens of thousands of GPUs, ensuring runtime reliability becomes both more challenging and more critical for maintaining efficiency. This talk explores how fine-grained observability can substantially enhance reliability in LLM training at scale. First, we discuss automated methods for detecting faulty machines by leveraging distinctive monitoring metric patterns, enabling rapid and accurate identification of problematic nodes while minimizing manual intervention. Second, we tackle reliability challenges within collective communication libraries (CCL), introducing a lightweight tracing and root cause analysis system that treats CCL as system software and reveals internal control and data dependencies. This approach allows for swift and precise detection of communication-related anomalies. Collectively, these advancements illustrate how fine-grained observability at both the machine and communication levels can significantly improve the robustness and operational efficiency of large-scale LLM training.
- WATCH NOW
- 2025 EVENTS
- PAST EVENTS
- 2024
- 2023
- 2022
- February
- RTC @Scale 2022
- March
- Systems @Scale Spring 2022
- April
- Product @Scale Spring 2022
- May
- Data @Scale Spring 2022
- June
- Systems @Scale Summer 2022
- Networking @Scale Summer 2022
- August
- Reliability @Scale Summer 2022
- September
- AI @Scale 2022
- November
- Networking @Scale Fall 2022
- Video @Scale Fall 2022
- December
- Systems @Scale Winter 2022
- 2021
- 2020
- 2019
- 2018
- 2017
- 2016
- 2015
- Blog & Video Archive
- Speaker Submissions