EVENT AGENDA
Event times below are displayed in PT.
Event times below are displayed in PT.
Meta operates a globally distributed Edge CDN and Edge Cloud infrastructure, responsible for providing L4 and L7 protocol termination and proxying which power core user/server communications for all major Meta use cases, ranging from news feed, media delivery, to messaging, VOIP, Cloud Gaming and MetaAI.
The recent growth of non-cacheable, unique AI generated content coupled with the real-time requirements for interactive experiences and the metaverse drives the evolution of our Edge.
We are re-architecting the Edge to be a decentralized, multi-tenanted compute platform providing optimal user experience for graphics rendering, hosting gaming applications and running AI inference while offering secondary benefits of saving backbone bandwidth.
Over the last three decades, extreme-scale high-performance computing (HPC) systems have evolved from highly-specialized hardware running custom software environments to platforms that are almost entirely composed of commodity components. While some aspects of large-scale HPC systems continue to be enhanced to meet performance and scalability demands, HPC systems have been distinguished by interconnect technology. The emergence of cloud computing, hyperscalers, and the demands of AI/ML workloads has led to the deployment of massive data centers containing systems much larger than the fastest HPC systems. Until recently, these systems were easily differentiated from HPC machines by the use of commodity ethernet networks. However, these worlds are now converging in several important ways. This presentation will describe how interconnect hardware and software for HPC systems has been impacted by this convergence and offer a perspective on future challenges that will need to be addressed.
Demand for AI capacity from our family of apps and products has accelerated the growth of our large backbone network. Initially, we expected AI-driven traffic to mainly stay within data centers. However, high replication and data freshness requirements, co-location challenges, and cross-region inference needs have increased traffic by 30-50%. To manage this, we've deepened our understanding of the AI traffic lifecycle (from data collection to training / inference) and controlled backbone traffic growth through efficient workload placement, scheduled bulk transfers, and quality of service initiatives. We've also had to build larger buffers to future-proof our network. This talk shares our learnings from addressing the surge in AI traffic on our backbone network.
Moderated by Ying Zhang
Training large models like Llama3 at scale introduces unprecedented challenges in terms of failures and performance degradation. Existing monitoring techniques fall short in providing the necessary insights for effective troubleshooting. This talk presents our experience in instrumenting the collective library to gain deeper visibility into network operations. We will share techniques for real-time performance tracing, grouping these operations into meaningful cohorts and analyzing their impact on model training. By co-relating model performance with network bottlenecks we can identify issues like slow ranks and also hard failures, ultimately improving training efficiency and reliability.
Due to the differences between LLMs and general cloud computing (e.g., in terms of traffic patterns and fault tolerance), traditional data center networks are not well-suited for LLM training. Unlike general cloud computing which generates millions of small flows (e.g., lower than 10Gbps), LLM training produces a small number of periodic, bursty flows (e.g., 400Gbps) on each host. HPN introduces a 2-tier, dual-plane architecture capable of interconnecting 15K GPUs within one Pod, typically accommodated by the traditional 3-tier Clos architecture. Such a new architecture design not only avoids hash polarization, but also greatly reduces the search space for path selection, thus allowing us to precisely select network paths capable of holding elephant flows. HPN also employs a dual-ToR design to avoid the single point of failure problem. We share our experience in motivating, designing, and building HPN, as well as the operational lessons of HPN in production.
Meta has focused on enhancing reliability in Backend (BE) and Frontend (FE) networks for AI training, ensuring low latency and high throughput for GPUs and stable data flow for checkpointing. We've implemented a dual monitoring strategy using SLI and evidence-based collections for improved network health analysis and faster issue detection. Stricter controls, on-box agents, and robust SLOs for repair times have been adopted to enhance monitoring and quicken issue resolution. These measures maintain optimal network performance, which is crucial for large-scale training, demonstrating our commitment to a robust and reliable network infrastructure for advanced AI training.
Every generation of Large Language models requires an order of magnitude large compute and networking infrastructure. Various model and data parallelism techniques distribute the computational complexity of the model onto infrastructure. Achieving optimal network communication performance requires careful consideration of how models map to network architecture and hierarchy. One of the important layers of the training stack that influences such mapping is the Job scheduler. This talk explores how we customized Meta’s Job Scheduler (MAST) to achieve optimal parallelism to network topology. We will cover how we represented our network hierarchy to be consumed by the scheduler as well as, the specific customizations we performed in order to ensure high bandwidth, latency sensitive communications map to the first few network hops. We will present a recent use case of Llama3 to demonstrate the parallelisms used during training and the impact scheduler modifications have had on performance and network efficiency of such parallelisms.
As AI continues to revolutionize industries, the demand for large-scale AI training clusters is rapidly increasing to meet the growing need for advanced computational capabilities. Fields such as autonomous driving, medical image analysis, language model training, financial modeling, and drug discovery require robust and scalable infrastructure to handle the complexity and scale of AI training workloads. Efficient and high performant network is a key component of distributed training, as it involves the coordination of multiple thousands of GPUs over extended periods. This creates significant challenges in routing and network reliability that must be addressed to ensure optimal performance and reduce job interruptions. To meet these demands, reevaluating data center network design and protocols is crucial for building reliable, high-performance infrastructure. In this talk, we explore the key challenges involved in designing and developing networks for large-scale AI training. By drawing on Microsoft’s experience with large-scale training cluster development, we present key insights and lessons learned, offering valuable guidance on creating scalable and efficient network infrastructures that can support complex AI workloads while maintaining robust routing and network reliability.
Network and Collective Communication stack plays a pivotal role in extracting the best performance out of large GenAI Clusters. In this talk, we will go over in-depth Network and Communicational library tuning that helped achieve optimal performance for GenAI Models such as LLaMA3. We’ll touch on both optimizations, from training workload as well as model serving perspective. We’ll dig into how we mitigated the impact of network latency by implementing novel collective algorithms, network routing enhancements and steps taken to reduce the impact of compute-overlap on communication time. We’ll provide our perspective on challenges that remain in scaling these models to a larger scale, while still achieving optimal Compute and Network efficiency.
Moderated by Shashi Gandham