Data, Systems and Networking
Network Communication Debuggability and Observability at Scale
Training large models like Llama3 at scale introduces unprecedented challenges in terms of failures and performance degradation. Existing monitoring techniques fall short in providing the necessary insights for effective troubleshooting. This talk presents our experience in instrumenting the collective library to gain deeper visibility into network operations. We will share techniques for real-time performance tracing, […]
WATCH VIDEO