In just two years, Meta has undergone a monumental transformation in its AI infrastructure, transitioning from a single research cluster to a sprawling network of nearly hundred AI super clusters of varying sizes with hundreds of thousands of GPUs. This rapid expansion has introduced a myriad of challenges, ranging from managing diverse hardware configurations to optimizing resource allocations. As part of this, we scaled our infrastructure to safely and predictably perform maintenance without disrupting the training jobs. As an example, our teams worked with teams at NVIDIA to create advancements in the GPU stack in the form of deep health checks, allowing us to rollout critical upgrades continuously.
During this talk, we will share insights into the key areas that demanded our attention and the solutions we implemented to address them. From implementing rolling updates for kernel drivers and firmware to leveraging redundancies and failover mechanisms, we will explore the technical intricacies involved in sustaining our AI infrastructure while conducting essential maintenance tasks. Furthermore, we will discuss the role of automation and orchestration in streamlining maintenance operations, minimizing downtime, and optimizing resource utilization.