Networking @Scale 2024

September 11, 2024 | Santa Clara Convention Center

Meta’s Networking team invites you to Networking@scale on September 11th. This year’s event is an in-person event hosted at the Santa Clara Convention center and will also be live streamed for virtual attendees. Registration is open now. This year’s event will continue to focus on the evolution of AI Networking.

As AI models continue to expand beyond the confines of the data center, the underlying network infrastructure must adapt to meet the demands of increasingly complex and computationally intensive workloads and models. We will explore how these trends are reshaping the backbone and edge networks, necessitating innovative approaches to network design and operations.

A cornerstone of this year’s event will be a deep dive into Meta’s co-design efforts for networks supporting state-of-the-art training models such as Llama3. Attendees will gain valuable insights into the engineering challenges and solutions required to scale back-end networks.

To address the growing complexity of network operations, we will examine a full-stack perspective towards debugging, encompassing the communications layer through to the hardware. By adopting a holistic approach across both our font-end and back-end networks we can identify and mitigate potential bottlenecks, ensuring optimal network performance. You will also hear from industry experts and leading researchers who are at the fore-front of building large scale networks. Attendees will benefit from the opportunity to learn about diverse approaches to solve common challenges and explore potential collaborations.

Join us for Networking@Scale at the mixer after the event to engage with fellow experts, share knowledge and contribute to advancing the field of AI Networking

RSVPS CLOSED

AGENDA SPEAKERS

EVENT AGENDA

Event times below are displayed in PT.

September 11

08:30 AM - 10:00 AM

Breakfast Available at Venue

10:00 AM - 10:05 AM

Opening Remarks

WATCH NOW

Speaker Omar Baldonado,Meta

10:05 AM - 10:25 AM

Evolving Meta's Edge Architecture

WATCH NOW

Meta operates a globally distributed Edge CDN and Edge Cloud infrastructure, responsible for providing L4 and L7 protocol termination and proxying which power core user/server communications for all major Meta use cases, ranging from news feed, media delivery, to messaging, VOIP, Cloud Gaming and MetaAI.

The recent growth of non-cacheable, unique AI generated content coupled with the real-time requirements for interactive experiences and the metaverse drives the evolution of our Edge.

We are re-architecting the Edge to be a decentralized, multi-tenanted compute platform providing optimal user experience for graphics rendering, hosting gaming applications and running AI inference while offering secondary benefits of saving backbone bandwidth.

Speaker Shivkumar Chandrashekhar,Meta

Speaker Lee Hetherington,Meta

10:25 AM - 10:45 AM

Future Challenges for HPC Networking @Scale

WATCH NOW

Over the last three decades, extreme-scale high-performance computing (HPC) systems have evolved from highly-specialized hardware running custom software environments to platforms that are almost entirely composed of commodity components. While some aspects of large-scale HPC systems continue to be enhanced to meet performance and scalability demands, HPC systems have been distinguished by interconnect technology. The emergence of cloud computing, hyperscalers, and the demands of AI/ML workloads has led to the deployment of massive data centers containing systems much larger than the fastest HPC systems. Until recently, these systems were easily differentiated from HPC machines by the use of commodity ethernet networks. However, these worlds are now converging in several important ways. This presentation will describe how interconnect hardware and software for HPC systems has been impacted by this convergence and offer a perspective on future challenges that will need to be addressed.

Speaker Ron Brightwell,Sandia National Labs

10:45 AM - 11:05 AM

AI Impact on Backbone

WATCH NOW

Demand for AI capacity from our family of apps and products has accelerated the growth of our large backbone network. Initially, we expected AI-driven traffic to mainly stay within data centers. However, high replication and data freshness requirements, co-location challenges, and cross-region inference needs have increased traffic by 30-50%. To manage this, we've deepened our understanding of the AI traffic lifecycle (from data collection to training / inference) and controlled backbone traffic growth through efficient workload placement, scheduled bulk transfers, and quality of service initiatives. We've also had to build larger buffers to future-proof our network. This talk shares our learnings from addressing the surge in AI traffic on our backbone network.

Speaker Jyotsna Sundaresan,Meta

Speaker Abishek Gopalan,Meta

11:05 AM - 11:35 AM

Live Q&A Session 1

WATCH NOW

Moderated by Ying Zhang

Moderator Ying Zhang,Meta

11:35 AM - 11:55 AM

Break

11:55 AM - 12:15 PM

Network Communication Debuggability and Observability at Scale

WATCH NOW

Training large models like Llama3 at scale introduces unprecedented challenges in terms of failures and performance degradation. Existing monitoring techniques fall short in providing the necessary insights for effective troubleshooting. This talk presents our experience in instrumenting the collective library to gain deeper visibility into network operations. We will share techniques for real-time performance tracing, grouping these operations into meaningful cohorts and analyzing their impact on model training. By co-relating model performance with network bottlenecks we can identify issues like slow ranks and also hard failures, ultimately improving training efficiency and reliability.

Speaker Ashmitha Jeevaraj Shetty,Meta

Speaker Min Si,Meta

12:15 PM - 12:35 PM

Alibaba HPN: A Data Center Network for Large Language Model Training

WATCH NOW

Due to the differences between LLMs and general cloud computing (e.g., in terms of traffic patterns and fault tolerance), traditional data center networks are not well-suited for LLM training. Unlike general cloud computing which generates millions of small flows (e.g., lower than 10Gbps), LLM training produces a small number of periodic, bursty flows (e.g., 400Gbps) on each host. HPN introduces a 2-tier, dual-plane architecture capable of interconnecting 15K GPUs within one Pod, typically accommodated by the traditional 3-tier Clos architecture. Such a new architecture design not only avoids hash polarization, but also greatly reduces the search space for path selection, thus allowing us to precisely select network paths capable of holding elephant flows. HPN also employs a dual-ToR design to avoid the single point of failure problem. We share our experience in motivating, designing, and building HPN, as well as the operational lessons of HPN in production.

Speaker Jiaqi Gao,Alibaba

12:35 PM - 12:55 PM

High Network Reliability and Availability in FE and BE for Scalable Training Solutions

WATCH NOW

Meta has focused on enhancing reliability in Backend (BE) and Frontend (FE) networks for AI training, ensuring low latency and high throughput for GPUs and stable data flow for checkpointing. We've implemented a dual monitoring strategy using SLI and evidence-based collections for improved network health analysis and faster issue detection. Stricter controls, on-box agents, and robust SLOs for repair times have been adopted to enhance monitoring and quicken issue resolution. These measures maintain optimal network performance, which is crucial for large-scale training, demonstrating our commitment to a robust and reliable network infrastructure for advanced AI training.

Speaker Jose Leitao,Meta

Speaker Robert Colantuoni,Meta

Featured Article

Enhancing Network Reliability for AI Training read more

12:55 PM - 01:25 PM

Live Q&A Session 2

WATCH NOW

Moderated by Joseph Provine

Moderator Joseph Provine,Meta

01:25 PM - 02:25 PM

Lunch Break

02:25 PM - 02:45 PM

Scheduler and Sharding Considerations for Network Efficiency

WATCH NOW

Every generation of Large Language models requires an order of magnitude large compute and networking infrastructure. Various model and data parallelism techniques distribute the computational complexity of the model onto infrastructure. Achieving optimal network communication performance requires careful consideration of how models map to network architecture and hierarchy. One of the important layers of the training stack that influences such mapping is the Job scheduler. This talk explores how we customized Meta’s Job Scheduler (MAST) to achieve optimal parallelism to network topology. We will cover how we represented our network hierarchy to be consumed by the scheduler as well as, the specific customizations we performed in order to ensure high bandwidth, latency sensitive communications map to the first few network hops. We will present a recent use case of Llama3 to demonstrate the parallelisms used during training and the impact scheduler modifications have had on performance and network efficiency of such parallelisms.

Speaker Weiwei Chu,Meta

Speaker Arnab Choudhury,Meta

Featured Article

Scheduling and Sharding Considerations for Network Efficiency read more

02:45 PM - 03:05 PM

Designing Scalable Networks for Large AI Clusters: Challenges and Key Insights

WATCH NOW

As AI continues to revolutionize industries, the demand for large-scale AI training clusters is rapidly increasing to meet the growing need for advanced computational capabilities. Fields such as autonomous driving, medical image analysis, language model training, financial modeling, and drug discovery require robust and scalable infrastructure to handle the complexity and scale of AI training workloads. Efficient and high performant network is a key component of distributed training, as it involves the coordination of multiple thousands of GPUs over extended periods. This creates significant challenges in routing and network reliability that must be addressed to ensure optimal performance and reduce job interruptions. To meet these demands, reevaluating data center network design and protocols is crucial for building reliable, high-performance infrastructure. In this talk, we explore the key challenges involved in designing and developing networks for large-scale AI training. By drawing on Microsoft’s experience with large-scale training cluster development, we present key insights and lessons learned, offering valuable guidance on creating scalable and efficient network infrastructures that can support complex AI workloads while maintaining robust routing and network reliability.

Speaker Jithin Jose,Microsoft

03:05 PM - 03:25 PM

Faster Than Fast: Networking and Communication Optimizations for Llama 3

WATCH NOW

Network and Collective Communication stack plays a pivotal role in extracting the best performance out of large GenAI Clusters. In this talk, we will go over in-depth Network and Communicational library tuning that helped achieve optimal performance for GenAI Models such as LLaMA3. We’ll touch on both optimizations, from training workload as well as model serving perspective. We’ll dig into how we mitigated the impact of network latency by implementing novel collective algorithms, network routing enhancements and steps taken to reduce the impact of compute-overlap on communication time. We’ll provide our perspective on challenges that remain in scaling these models to a larger scale, while still achieving optimal Compute and Network efficiency.

Speaker Pavan Balaji,Meta

Speaker Adi Gangidi,META

03:25 PM - 03:55 PM

Live Q&A Session 3

WATCH NOW

Moderated by Shashi Gandham

Moderator Shashi Gandham,META

03:55 PM - 04:00 PM

Closing Remarks

Speaker Rajiv Krishnamurthy,Meta

04:00 PM - 05:30 PM

Happy Hour

SPEAKERS AND MODERATORS

Omar is an Engineering Director at Meta. read more

Omar Baldonado

Meta

Shivkumar is an engineering manager at Meta Infrastructure. He has led engineering teams to... read more

Shivkumar Chandrashekhar

Meta

Lee is a Production Engineer at Meta. read more

Lee Hetherington

Meta

Ron Brightwell currently leads the Scalable System Software Department in the Center for Computing... read more

Ron Brightwell

Sandia National Labs

Jyotsna is a seasoned professional with over 10 years of experience in the TMT... read more

Jyotsna Sundaresan

Meta

Abishek Gopalan is a network modeling and optimization engineer at Meta Platforms since 2017.... read more

Abishek Gopalan

Meta

Ying Zhang is a Software Engineering Manager at Meta, where she leads the core... read more

Ying Zhang

Meta

Ashmitha Shetty is a network engineer at Meta. She is a part of the... read more

Ashmitha Jeevaraj Shetty

Meta

Min is a Research Scientist at Meta. read more

Min Si

Meta

Dr. Jiaqi Gao is a Staff Engineer at Alibaba Cloud Networking Team, where he... read more

Jiaqi Gao

Alibaba

Jose Leitao is a production network engineer in the Network organization at Meta. His... read more

Jose Leitao

Meta

Robert is a production engineering lead supporting Meta datacenters. His work focuses on increasing... read more

Robert Colantuoni

Meta

Joseph Provine supports NIC and AI Transport at Meta. read more

Joseph Provine

Meta

Weiwei is a Research Scientist at Meta. read more

Weiwei Chu

Meta

Arnab is a Software Engineer at Meta. read more

Arnab Choudhury

Meta

Dr. Jithin Jose serves as a Principal Software Engineering Manager at Microsoft, where he... read more

Jithin Jose

Microsoft

Pavan is a Research Scientist at Meta. read more

Pavan Balaji

Meta

Adi is a Hardware Systems Engineer at Meta. read more

Adi Gangidi

Shashi Gandham

Rajiv Krishnamurthy

Meta

LATEST NOTES

Networking @Scale

09/23/2024

Enhancing Network Reliability for AI Training

As we continue to push the boundaries of AI, the reliability of our networks becomes increasingly important. We will share...

Networking @Scale

09/11/2024

Scheduling and Sharding Considerations for Network Efficiency

As Meta continues to drive and rapidly evolve AI models for our products, including AGI, and for the community, how...

past EVENT November 20-21, 2024 | Video @Scale

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...

PAST EVENT March 20, 2024 @ 9am PT - 3pm PT | RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

Past EVENT May 22, 2024 | Data @Scale

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...

Past EVENT June 12, 2024 | Systems @Scale

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...

Past EVENT JULY 31, 2024 @ 2:30 PM PDT - 7:00 PM PDT - IN PERSON EVENT | AUGUST 7, 2024 @ 2:30 PM PDT - 5:30 PM PDT - VIRTUAL PROGRAM | AI Infra @Scale

AI Infra @Scale 2024

Meta’s Engineering and Infrastructure teams are excited to return for the second year in a row to host AI Infra @Scale on July 31. This year’s event is open to a limited number of in-person...

Past EVENT August 14, 2024 | Product @Scale

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. The @Scale community focuses on bringing forward people's experiences in creating innovative solutions to large-scale products serving millions or...

Past EVENT September 11, 2024 | Santa Clara Convention Center | Networking @Scale

Networking @Scale 2024

Past EVENT October 9, 2024 | Reliability @Scale

Reliability @Scale 2024

In the digital age, where systems operate at unprecedented scales, the importance of robust configuration management cannot be overstated. This year’s Reliability @Scale will focus on a central theme of "Move Safely", emphasizing the critical...

Past EVENT October 23, 2024 | Mobile @Scale

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...