Systems and Networking
Leveraging Agents to Debug NCCL Watchdog Timeouts:
A Technical, Repeatable Workflow Phillip Liu, John Wu, and Uttam Thakore NCCL (NVIDIA Collective Communications Library) watchdog timeouts are one of the most burdensome failure modes in large-scale distributed training. They show up intermittently and waste GPU hours, and the surface error is usually far away from the real bug. The debugging experience is also […]
READ MORE