Networking @Scale 2023

SEPTEMBER 07, 2023 @ 9:00 AM PDT - 12:00 PM PDT

Networking @Scale is a technical conference for engineers that build and manage large-scale networks. Meta’s Networking Infrastructure team is excited to host Networking @Scale: AI Networking Edition, a one-day virtual event featuring a range of speakers from Meta who will share how Meta is creating, designing, building and operating the next generation networking infrastructure to scale and support some of the largest AI workloads and technologies that power Meta’s products and services.

RSVPS CLOSED

AGENDA SPEAKERS

ABOUT EVENT

A scalable and performant networking infrastructure is the foundation for the deployment of applications and services that serve billions of users across the globe. Building and operating such large-scale networks often present complex engineering challenges to solve. The Networking @Scale community focuses on bringing people together to discuss these challenges and collaborate on the development of new solutions.

The 2023 edition of Networking @Scale will focus on AI Networking. This is a one-day virtual conference that will showcase how Meta is designing, operating and innovating the next generation of network infrastructure that supports some of the largest AI infrastructure that power Meta’s products and services today. The event will showcase six technical presentations on the evolution of networking technologies/solutions that address the requirements and challenges of the modern day AI workloads within Meta’s infrastructure.

EVENT AGENDA

Event times below are displayed in PT.

September 7

09:00 AM - 09:05 AM

Welcome Remarks

WATCH NOW

Speaker ,

Speaker Tanuja Ingale,Meta

09:05 AM - 09:30 AM

Networking for GenAI training and inference clusters

WATCH NOW

Generative AI (genAI) is rapidly evolving and has become one of top priorities of Meta. GenAI introduces new challenges to the infrastructure in particular for network due to its sheer scale and complexity of the models. We will discuss what are the unique challenges in particular from large language models, comparing with recommendation models that have been the primary AI workloads at Meta.

Speaker Jongsoo Park,Meta

Speaker Petr Lapukhov,META

09:30 AM - 09:50 AM

Meta’s Network Journey to Enable AI

WATCH NOW

Over the years, Meta's AI infrastructure has undergone a remarkable transformation, transitioning from CPU-based training to GPU-based training within the same host, and ultimately adopting distributed systems interconnected by a network. Today, our model training heavily relies on a RoCE-based network fabric with a CLOS topology, where leaf switches are connected to GPU hosts and spine switches provide the Scale-Out connectivity to GPUs in the cluster. This presentation will delve into the progressive evolution of our network builds, specifically tailored to support the demanding requirements of AI services. Attendees will gain insights into the challenges encountered, innovative solutions implemented, and the strategic considerations behind building an efficient and high-performance fabric for AI workloads at Meta.

Speaker Hany Morsy,Meta

Speaker Susana Contrera,Meta

09:50 AM - 10:05 AM

Traffic Engineering for AI Training Networks

WATCH NOW

Meta has been operating RoCE-based distributed training clusters serving internal AI training workloads since 2020. One major challenge surfaced in the early days was the job performance inconsistency over different job scheduling schemes and network failures. This was attributed to the static routing scheme we employed and triggered us to proceed on multiple paths to address them.

Centralized Traffic Engineering, which dynamically places traffic over all available paths in a load balanced manner, is one of the most promising solutions we have adopted to address the challenge. In this talk, we will go over the design, development, evaluation, and operational experience of the centralized traffic engineering solution.

Speaker Shuqiang Zhang,Meta

Speaker Jingyi Yang,Meta

10:05 AM - 10:30 AM

Live Q&A Session

WATCH NOW

Moderator James Zeng,Meta

10:30 AM - 10:40 AM

Break

10:40 AM - 11:00 AM

Scaling RoCE Networks for AI Training

WATCH NOW

In this talk we provide an overview of Meta's RDMA deployment based on RoCEV2 transport for supporting our production AI Training infrastructure. We will shed light on how we designed our infrastructure to both maximize raw performance and consistency that is fundamental for the workload. We will talk about the challenges we solved in Routing, Transport and Hardware layers we solved along the way to scale our infrastructure. We will also touch on opportunities that remain in this space to make further progress over the next few years.

Speaker Adi Gangidi,Meta

11:00 AM - 11:20 AM

Network Observability for AI/HPC Training Workflows

WATCH NOW

High-performance and reliable collective communication over AI-Zone RDMA network, is foundational for enabling and scaling Meta AI training / inference workloads. It is necessary to capture top-down observability from workload to network for collective communication, and therefore attribute performance regression and training failures to backend network. For this purpose, we introduced two important tools: ROCET and PARAM benchmark and Chakra ecosystems. We build ROCET to associates the job to RDMA network metrics and provide analysis on top. In addition, we build PARAM benchmark to allow analyzing and tuning collective communication operations through workload trace, and recently scale them to the community with Chakra for co-designing efficient distributed ML systems. In this talk, we will go over their design and use cases.

Speaker Shengbao Zheng,Meta

11:20 AM - 11:40 AM

Arcadia: End-to-end AI System Performance Simulator

WATCH NOW

This presentation will introduce Arcadia, a unified system designed to simulate compute, memory, and network performance of AI training clusters. By providing a multi-disciplinary performance analysis framework, Arcadia aims to facilitate the design and optimization of various system levels, including application, network, and hardware. This comprehensive system enables researchers and practitioners to gain valuable insights into the performance of future AI models and workloads on specific infrastructures, fostering data-driven decision-making processes and promoting the future evolution of models and hardware. Arcadia provides ability to simulate performance impact of scheduled operational tasks on AI-models that are running in production; helps an engineer to make job-aware decisions during day-to-day operational activity. Attendees will learn about the capabilities and potential impact of Arcadia in advancing the field of AI systems and infrastructure.

Speaker Zhaodong Wang,Meta

Speaker Satyajeet Singh Ahuja,Meta

11:40 AM - 12:00 PM

Live Q&A Session

WATCH NOW

Moderator Joseph Provine,Meta

SPEAKERS AND MODERATORS

Tanuja Ingale is a Technical Program Manager in the Production Network Infrastructure group at... read more

Tanuja Ingale

Meta

Jongsoo is a research scientist at Meta, AI Systems Co-design team, optimizing SW for... read more

Jongsoo Park

Meta

Petr Lapukhov is a Network Engineer who spent nearly ten years at Meta and,... read more

Petr Lapukhov

Hany Morsy

Meta

Susana Contrera is an Infrastructure Network Engineer at Meta. Her team is a key... read more

Susana Contrera

Meta

Shuqiang Zhang is a Software Engineer at Meta. He is currently working on performance... read more

Shuqiang Zhang

Meta

Jingyi Yang is a software engineer on the network.ai team at Meta where she... read more

Jingyi Yang

Meta

James Zeng currently leads AI Networking Software team at Meta. Since joining Meta in... read more

James Zeng

Meta

Adi is a Hardware Systems Engineer at Meta. read more

Adi Gangidi

Meta

Shengbao is a Research Scientist at Meta. He is part of the AI Networking... read more

Shengbao Zheng

Meta

Zhaodong Wang is a research scientist and Tech lead at Meta network infra team.... read more

Zhaodong Wang

Meta

I am a Network Modeling and Optimization Engineer at Meta. Before Meta, I was... read more

Satyajeet Singh Ahuja

Meta

Joseph Provine supports NIC and AI Transport at Meta. read more

Joseph Provine

Meta

past EVENT November 20-21, 2024 | Mobile, Video and Web

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...

PAST EVENT March 20, 2024 @ 9am PT - 3pm PT | Mobile, Video and Web

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

Past EVENT May 22, 2024 | Data, Machine Learning and AI

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...

Past EVENT June 12, 2024 | Systems and Networking

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...

Past EVENT JULY 31, 2024 @ 2:30 PM PDT - 7:00 PM PDT - IN PERSON EVENT | AUGUST 7, 2024 @ 2:30 PM PDT - 5:30 PM PDT - VIRTUAL PROGRAM | Data, Machine Learning and AI

AI Infra @Scale 2024

Meta’s Engineering and Infrastructure teams are excited to return for the second year in a row to host AI Infra @Scale on July 31. This year’s event is open to a limited number of in-person...

Past EVENT August 14, 2024 | Mobile, Video and Web

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. The @Scale community focuses on bringing forward people's experiences in creating innovative solutions to large-scale products serving millions or...

Past EVENT September 11, 2024 | Santa Clara Convention Center | Systems and Networking

Networking @Scale 2024

Meta’s Networking team invites you to Networking@scale on September 11th. This year’s event is an in-person event hosted at the Santa Clara Convention center and will also be live streamed for virtual attendees. Registration is...

Past EVENT October 9, 2024 | Systems and Networking

Reliability @Scale 2024

In the digital age, where systems operate at unprecedented scales, the importance of robust configuration management cannot be overstated. This year’s Reliability @Scale will focus on a central theme of "Move Safely", emphasizing the critical...

Past EVENT October 23, 2024 | Mobile, Video and Web

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...