TOPIC: Data, Systems and Networking

Networking @Scale 2023

SEPTEMBER 07, 2023 @ 9:00 AM PDT - 12:00 PM PDT

Networking @Scale is a technical conference for engineers that build and manage large-scale networks. Meta’s Networking Infrastructure team is excited to host Networking @Scale: AI Networking Edition, a one-day virtual event featuring a range of speakers from Meta who will share how Meta is creating, designing, building and operating the next generation networking infrastructure to scale and support some of the largest AI workloads and technologies that power Meta’s products and services.

Register today and check back for upcoming speaker and agenda announcements!

RSVPS CLOSED
AGENDA SPEAKERS

ABOUT EVENT

A scalable and performant networking infrastructure is the foundation for the deployment of applications and services that serve billions of users across the globe. Building and operating such large-scale networks often present complex engineering challenges to solve. The Networking @Scale community focuses on bringing people together to discuss these challenges and collaborate on the development of new solutions.

The 2023 edition of Networking @Scale will focus on AI Networking. This is a one-day virtual conference that will showcase how Meta is designing, operating and innovating the next generation of network infrastructure that supports some of the largest AI infrastructure that power Meta’s products and services today. The event will showcase six technical presentations on the evolution of networking technologies/solutions that address the requirements and challenges of the modern day AI workloads within Meta’s infrastructure.

EVENT AGENDA

Event times below are displayed in PT.

September 7

09:00 AM - 09:05 AM
Welcome Remarks
Speaker Rajiv Krishnamurthy,Meta
Speaker Tanuja Ingale,Meta
09:05 AM - 09:30 AM
Networking for GenAI training and inference clusters

Generative AI (genAI) is rapidly evolving and has become one of top priorities of Meta. GenAI introduces new challenges to the infrastructure in particular for network due to its sheer scale and complexity of the models. We will discuss what are the unique challenges in particular from large language models, comparing with recommendation models that have been the primary AI workloads at Meta.

Speaker Jongsoo Park,Meta
Speaker Petr Lapukhov,META
09:30 AM - 09:50 AM
Meta’s Network Journey to Enable AI

Over the years, Meta's AI infrastructure has undergone a remarkable transformation, transitioning from CPU-based training to GPU-based training within the same host, and ultimately adopting distributed systems interconnected by a network. Today, our model training heavily relies on a RoCE-based network fabric with a CLOS topology, where leaf switches are connected to GPU hosts and spine switches provide the Scale-Out connectivity to GPUs in the cluster. This presentation will delve into the progressive evolution of our network builds, specifically tailored to support the demanding requirements of AI services. Attendees will gain insights into the challenges encountered, innovative solutions implemented, and the strategic considerations behind building an efficient and high-performance fabric for AI workloads at Meta.

Speaker Hany Morsy,Meta
Speaker Susana Contrera,Meta
09:50 AM - 10:05 AM
Traffic Engineering for AI Training Networks

Meta has been operating RoCE-based distributed training clusters serving internal AI training workloads since 2020. One major challenge surfaced in the early days was the job performance inconsistency over different job scheduling schemes and network failures. This was attributed to the static routing scheme we employed and triggered us to proceed on multiple paths to address them.

Centralized Traffic Engineering, which dynamically places traffic over all available paths in a load balanced manner, is one of the most promising solutions we have adopted to address the challenge. In this talk, we will go over the design, development, evaluation, and operational experience of the centralized traffic engineering solution.

Speaker Shuqiang Zhang,Meta
Speaker Jingyi Yang,Meta
10:05 AM - 10:30 AM
Live Q&A Session
Moderator James Zeng,Meta
10:30 AM - 10:40 AM
Break
10:40 AM - 11:00 AM
Scaling RoCE Networks for AI Training

In this talk we provide an overview of Meta's RDMA deployment based on RoCEV2 transport for supporting our production AI Training infrastructure. We will shed light on how we designed our infrastructure to both maximize raw performance and consistency that is fundamental for the workload. We will talk about the challenges we solved in Routing, Transport and Hardware layers we solved along the way to scale our infrastructure. We will also touch on opportunities that remain in this space to make further progress over the next few years.

Speaker Adi Gangidi,Meta
11:00 AM - 11:20 AM
Network Observability for AI/HPC Training Workflows

High-performance and reliable collective communication over AI-Zone RDMA network, is foundational for enabling and scaling Meta AI training / inference workloads. It is necessary to capture top-down observability from workload to network for collective communication, and therefore attribute performance regression and training failures to backend network. For this purpose, we introduced two important tools: ROCET and PARAM benchmark and Chakra ecosystems. We build ROCET to associates the job to RDMA network metrics and provide analysis on top. In addition, we build PARAM benchmark to allow analyzing and tuning collective communication operations through workload trace, and recently scale them to the community with Chakra for co-designing efficient distributed ML systems. In this talk, we will go over their design and use cases.

Speaker Shengbao Zheng,Meta
11:20 AM - 11:40 AM
Arcadia: End-to-end AI System Performance Simulator

This presentation will introduce Arcadia, a unified system designed to simulate compute, memory, and network performance of AI training clusters. By providing a multi-disciplinary performance analysis framework, Arcadia aims to facilitate the design and optimization of various system levels, including application, network, and hardware. This comprehensive system enables researchers and practitioners to gain valuable insights into the performance of future AI models and workloads on specific infrastructures, fostering data-driven decision-making processes and promoting the future evolution of models and hardware. Arcadia provides ability to simulate performance impact of scheduled operational tasks on AI-models that are running in production; helps an engineer to make job-aware decisions during day-to-day operational activity. Attendees will learn about the capabilities and potential impact of Arcadia in advancing the field of AI systems and infrastructure.

Speaker Zhaodong Wang,Meta
Speaker Satyajeet Singh Ahuja,Meta
11:40 AM - 12:00 PM
Live Q&A Session
Moderator Joseph Provine,Meta

SPEAKERS AND MODERATORS

Rajiv is a Software Engineering Director in the Network Infrastructure group at Meta. He... read more

Rajiv Krishnamurthy

Meta

Tanuja Ingale is a Technical Program Manager in the Production Network Infrastructure group at... read more

Tanuja Ingale

Meta

Jongsoo is a research scientist at Meta, AI Systems Co-design team, optimizing SW for... read more

Jongsoo Park

Meta

Petr Lapukhov is a Network Engineer who spent nearly ten years at Meta and,... read more

Petr Lapukhov

META

Hany Morsy is a highly skilled Network Engineer with over 25 years of experience... read more

Hany Morsy

Meta

Susana Contrera is an Infrastructure Network Engineer at Meta. Her team is a key... read more

Susana Contrera

Meta

Shuqiang Zhang is a Software Engineer at Meta. He is currently working on performance... read more

Shuqiang Zhang

Meta

Jingyi Yang is a software engineer on the network.ai team at Meta where she... read more

Jingyi Yang

Meta

James Zeng currently leads AI Networking Software team at Meta. Since joining Meta in... read more

James Zeng

Meta

At Meta, I lead RDMA Network design and deployments for AI workloads. Before this,... read more

Adi Gangidi

Meta

Shengbao is a Research Scientist at Meta. He is part of the AI Networking... read more

Shengbao Zheng

Meta

Zhaodong Wang is a research scientist and Tech lead at Meta network infra team.... read more

Zhaodong Wang

Meta

I am a Network Modeling and Optimization Engineer at Meta. Before Meta, I was... read more

Satyajeet Singh Ahuja

Meta

Joseph Provine supports NIC and AI Transport teams at Meta. Prior to Meta he... read more

Joseph Provine

Meta
UPCOMING EVENT   May 22, 2024 Data @Scale

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...
UPCOMING EVENT   June 12, 2024 Systems @Scale

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...
UPCOMING EVENT   07/31/2024 AI @Scale

AI Infra @Scale 2024

Meta's Engineering and Infrastructure teams are excited to host AI Infra @Scale, a one-day virtual event featuring a range of speakers from Meta who will unveil the latest AI infrastructure investments and innovations powering Meta's...
UPCOMING EVENT   August 7, 2024 Product @Scale

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. This year focuses on discussions that explore the creator ecosystem, and how AI will play a role in scaling...
UPCOMING EVENT   September 4-5, 2024 (2 day event) Networking @Scale

Networking @Scale 2024

Networking @Scale is a technical conference for engineers that build and manage large-scale networks. Meta’s Networking Infrastructure team is excited to host Networking @Scale, a two-day virtual event featuring a range of speakers from Meta...
UPCOMING EVENT   September 25, 2024 Reliability @Scale

Reliability @Scale 2024

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems and products at massive scale. Whether it’s novel design decisions, or outages that impact billions...
UPCOMING EVENT   October 23, 2024 Mobile @Scale

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...
UPCOMING EVENT   November 20, 2024 Video @Scale

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...
PAST EVENT   March 20, 2024 @ 9am PT - 3pm PT RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy