Enhancing Runtime Reliability in LLM Training via Fine-Grained Observability

Lei Zhang

ByteDance

TOPIC: Systems and Networking

@SCALE SERIES: Networking @Scale

TYPE: video

YEAR: 2025

TAGS:

As large language model (LLM) training scales across tens of thousands of GPUs, ensuring runtime reliability becomes both more challenging and more critical for maintaining efficiency. This talk explores how fine-grained observability can substantially enhance reliability in LLM training at scale. First, we discuss automated methods for detecting faulty machines by leveraging distinctive monitoring metric patterns, enabling rapid and accurate identification of problematic nodes while minimizing manual intervention. Second, we tackle reliability challenges within collective communication libraries (CCL), introducing a lightweight tracing and root cause analysis system that treats CCL as system software and reveals internal control and data dependencies. This approach allows for swift and precise detection of communication-related anomalies. Collectively, these advancements illustrate how fine-grained observability at both the machine and communication levels can significantly improve the robustness and operational efficiency of large-scale LLM training.

SUBSCRIBE TO @SCALE

Go back