EVENT AGENDA
Event times below are displayed in PT.
Meta’s Networking team invites you to Networking@scale on September 11th. This year’s event is an in-person event hosted at the Santa Clara Convention center and will also be live streamed for virtual attendees. Registration is open now. This year’s event will continue to focus on the evolution of AI Networking.
As AI models continue to expand beyond the confines of the data center, the underlying network infrastructure must adapt to meet the demands of increasingly complex and computationally intensive workloads and models. We will explore how these trends are reshaping the backbone and edge networks, necessitating innovative approaches to network design and operations.
A cornerstone of this year’s event will be a deep dive into Meta’s co-design efforts for networks supporting state-of-the-art training models such as Llama3. Attendees will gain valuable insights into the engineering challenges and solutions required to scale back-end networks.
To address the growing complexity of network operations, we will examine a full-stack perspective towards debugging, encompassing the communications layer through to the hardware. By adopting a holistic approach across both our font-end and back-end networks we can identify and mitigate potential bottlenecks, ensuring optimal network performance. You will also hear from industry experts and leading researchers who are at the fore-front of building large scale networks. Attendees will benefit from the opportunity to learn about diverse approaches to solve common challenges and explore potential collaborations.
Join us for Networking@Scale at the mixer after the event to engage with fellow experts, share knowledge and contribute to advancing the field of AI Networking
Event times below are displayed in PT.
Meta operates a globally distributed Edge CDN and Edge Cloud infrastructure, responsible for providing L4 and L7 protocol termination and proxying which power core user/server communications for all major Meta use cases, ranging from news feed, media delivery, to messaging, VOIP, Cloud Gaming and MetaAI.
The recent growth of non-cacheable, unique AI generated content coupled with the real-time requirements for interactive experiences and the metaverse drives the evolution of our Edge.
We are re-architecting the Edge to be a decentralized, multi-tenanted compute platform providing optimal user experience for graphics rendering, hosting gaming applications and running AI inference while offering secondary benefits of saving backbone bandwidth.
Over the last three decades, extreme-scale high-performance computing (HPC) systems have evolved from highly-specialized hardware running custom software environments to platforms that are almost entirely composed of commodity components. While some aspects of large-scale HPC systems continue to be enhanced to meet performance and scalability demands, HPC systems have been distinguished by interconnect technology. The emergence of cloud computing, hyperscalers, and the demands of AI/ML workloads has led to the deployment of massive data centers containing systems much larger than the fastest HPC systems. Until recently, these systems were easily differentiated from HPC machines by the use of commodity ethernet networks. However, these worlds are now converging in several important ways. This presentation will describe how interconnect hardware and software for HPC systems has been impacted by this convergence and offer a perspective on future challenges that will need to be addressed.
Demand for AI capacity from our family of apps and products has accelerated the growth of our large backbone network. Initially, we expected AI-driven traffic to mainly stay within data centers. However, high replication and data freshness requirements, co-location challenges, and cross-region inference needs have increased traffic by 30-50%. To manage this, we've deepened our understanding of the AI traffic lifecycle (from data collection to training / inference) and controlled backbone traffic growth through efficient workload placement, scheduled bulk transfers, and quality of service initiatives. We've also had to build larger buffers to future-proof our network. This talk shares our learnings from addressing the surge in AI traffic on our backbone network.
Moderated by Ying Zhang
Training large models like Llama3 at scale introduces unprecedented challenges in terms of failures and performance degradation. Existing monitoring techniques fall short in providing the necessary insights for effective troubleshooting. This talk presents our experience in instrumenting the collective library to gain deeper visibility into network operations. We will share techniques for real-time performance tracing, grouping these operations into meaningful cohorts and analyzing their impact on model training. By co-relating model performance with network bottlenecks we can identify issues like slow ranks and also hard failures, ultimately improving training efficiency and reliability.
Due to the differences between LLMs and general cloud computing (e.g., in terms of traffic patterns and fault tolerance), traditional data center networks are not well-suited for LLM training. Unlike general cloud computing which generates millions of small flows (e.g., lower than 10Gbps), LLM training produces a small number of periodic, bursty flows (e.g., 400Gbps) on each host. HPN introduces a 2-tier, dual-plane architecture capable of interconnecting 15K GPUs within one Pod, typically accommodated by the traditional 3-tier Clos architecture. Such a new architecture design not only avoids hash polarization, but also greatly reduces the search space for path selection, thus allowing us to precisely select network paths capable of holding elephant flows. HPN also employs a dual-ToR design to avoid the single point of failure problem. We share our experience in motivating, designing, and building HPN, as well as the operational lessons of HPN in production.
Meta has focused on enhancing reliability in Backend (BE) and Frontend (FE) networks for AI training, ensuring low latency and high throughput for GPUs and stable data flow for checkpointing. We've implemented a dual monitoring strategy using SLI and evidence-based collections for improved network health analysis and faster issue detection. Stricter controls, on-box agents, and robust SLOs for repair times have been adopted to enhance monitoring and quicken issue resolution. These measures maintain optimal network performance, which is crucial for large-scale training, demonstrating our commitment to a robust and reliable network infrastructure for advanced AI training.
Every generation of Large Language models requires an order of magnitude large compute and networking infrastructure. Various model and data parallelism techniques distribute the computational complexity of the model onto infrastructure. Achieving optimal network communication performance requires careful consideration of how models map to network architecture and hierarchy. One of the important layers of the training stack that influences such mapping is the Job scheduler. This talk explores how we customized Meta’s Job Scheduler (MAST) to achieve optimal parallelism to network topology. We will cover how we represented our network hierarchy to be consumed by the scheduler as well as, the specific customizations we performed in order to ensure high bandwidth, latency sensitive communications map to the first few network hops. We will present a recent use case of Llama3 to demonstrate the parallelisms used during training and the impact scheduler modifications have had on performance and network efficiency of such parallelisms.
As AI continues to revolutionize industries, the demand for large-scale AI training clusters is rapidly increasing to meet the growing need for advanced computational capabilities. Fields such as autonomous driving, medical image analysis, language model training, financial modeling, and drug discovery require robust and scalable infrastructure to handle the complexity and scale of AI training workloads. Efficient and high performant network is a key component of distributed training, as it involves the coordination of multiple thousands of GPUs over extended periods. This creates significant challenges in routing and network reliability that must be addressed to ensure optimal performance and reduce job interruptions. To meet these demands, reevaluating data center network design and protocols is crucial for building reliable, high-performance infrastructure. In this talk, we explore the key challenges involved in designing and developing networks for large-scale AI training. By drawing on Microsoft’s experience with large-scale training cluster development, we present key insights and lessons learned, offering valuable guidance on creating scalable and efficient network infrastructures that can support complex AI workloads while maintaining robust routing and network reliability.
Network and Collective Communication stack plays a pivotal role in extracting the best performance out of large GenAI Clusters. In this talk, we will go over in-depth Network and Communicational library tuning that helped achieve optimal performance for GenAI Models such as LLaMA3. We’ll touch on both optimizations, from training workload as well as model serving perspective. We’ll dig into how we mitigated the impact of network latency by implementing novel collective algorithms, network routing enhancements and steps taken to reduce the impact of compute-overlap on communication time. We’ll provide our perspective on challenges that remain in scaling these models to a larger scale, while still achieving optimal Compute and Network efficiency.
Moderated by Shashi Gandham
Omar is an Engineering Director at Meta. read more
Shivkumar is an engineering manager at Meta Infrastructure. He has led engineering teams to... read more
Lee is a Production Engineer at Meta. read more
Ron Brightwell currently leads the Scalable System Software Department in the Center for Computing... read more
Jyotsna is a seasoned professional with over 10 years of experience in the TMT... read more
Abishek Gopalan is a network modeling and optimization engineer at Meta Platforms since 2017.... read more
Ying Zhang is a Software Engineering Manager at Meta, where she leads the core... read more
Ashmitha Shetty is a network engineer at Meta. She is a part of the... read more
Min is a Research Scientist at Meta. read more
Dr. Jiaqi Gao is a Staff Engineer at Alibaba Cloud Networking Team, where he... read more
Jose Leitao is a production network engineer in the Network organization at Meta. His... read more
Robert is a production engineering lead supporting Meta datacenters. His work focuses on increasing... read more
Joseph Provine supports NIC and AI Transport at Meta. read more
Weiwei is a Research Scientist at Meta. read more
Arnab is a Software Engineer at Meta. read more
Dr. Jithin Jose serves as a Principal Software Engineering Manager at Microsoft, where he... read more
Pavan is a Research Scientist at Meta. read more
Adi is a Hardware Systems Engineer at Meta. read more
Shashi is currently Director of Software Engineering and supports network.ai. His team is responsible... read more
Rajiv currently is a Software Engineering Director in the Network Infrastructure group at Meta.... read more