News & Ideas | At Scale Conferences

A Technical, Repeatable Workflow Phillip Liu, John Wu, and Uttam Thakore NCCL (NVIDIA Collective Communications Library) watchdog timeouts are one of the most burdensome failure modes in large-scale distributed training. They show up intermittently and waste GPU hours, and the surface error is usually far away from the real bug. The debugging experience is also […]

By: Phillip Liu

Systems and Networking

Leveraging Agents to Debug NCCL Watchdog Timeouts:

Gaurav Mitra In December 2024, a configuration change at Meta caused a widespread routing error across every region. Within minutes, BGP (border gateway protocol) processes were crashing across multiple regions. The rollback could not restore a clean state because the routing was already inconsistent. Persistent packet loss spread across every product. We declared a SEV […]

By: Gaurav Mitra

Systems and Networking

Teaching AI to Fight Fires: Building the Reliability Flywheel at Meta

Phil Lopreiato, Rahul Iyengar, Richard Ross, Jonathan Kaldor, and Gautam Sewani As Meta’s infrastructure has grown into hyperscale, we need to be prepared for increasingly complicated failure scenarios. Some of them are software bugs, others are hardware issues, and even more are external events such as power or fiber outages. Increasingly complex failures also mean […]

By: Phill Lopreiato ...

Systems and Networking

(Almost) Fail & Tell: Stop the World

Komal Mangtani, Can Lin, Projjal Ghosh, and Iuliu Rus A year ago the question about enterprise AI was, “Can we build agents?” Now we’ve reached the point where the question has flipped, and we are asking, “Can we govern them and safely feed them at scale?” Today 80% of organizations have AI deployments, however only […]

By: Komal Mangtani ...

Systems and Networking

LATEST ON @SCALE

Leveraging Agents to Debug NCCL Watchdog Timeouts:

Teaching AI to Fight Fires: Building the Reliability Flywheel at Meta

(Almost) Fail & Tell: Stop the World

Data Governance in the World of Agents