Event times below are displayed in PT.
Networking @Scale is a technical conference for engineers that build and manage large-scale networks. Meta’s Networking Infrastructure team is excited to host Networking @Scale: AI Networking Edition, a one-day virtual event featuring a range of speakers from Meta who will share how Meta is creating, designing, building and operating the next generation networking infrastructure to scale and support some of the largest AI workloads and technologies that power Meta’s products and services.
Register today and check back for upcoming speaker and agenda announcements!
A scalable and performant networking infrastructure is the foundation for the deployment of applications and services that serve billions of users across the globe. Building and operating such large-scale networks often present complex engineering challenges to solve. The Networking @Scale community focuses on bringing people together to discuss these challenges and collaborate on the development of new solutions.
The 2023 edition of Networking @Scale will focus on AI Networking. This is a one-day virtual conference that will showcase how Meta is designing, operating and innovating the next generation of network infrastructure that supports some of the largest AI infrastructure that power Meta’s products and services today. The event will showcase six technical presentations on the evolution of networking technologies/solutions that address the requirements and challenges of the modern day AI workloads within Meta’s infrastructure.
Event times below are displayed in PT.
Generative AI (genAI) is rapidly evolving and has become one of top priorities of Meta. GenAI introduces new challenges to the infrastructure in particular for network due to its sheer scale and complexity of the models. We will discuss what are the unique challenges in particular from large language models, comparing with recommendation models that have been the primary AI workloads at Meta.
Over the years, Meta's AI infrastructure has undergone a remarkable transformation, transitioning from CPU-based training to GPU-based training within the same host, and ultimately adopting distributed systems interconnected by a network. Today, our model training heavily relies on a RoCE-based network fabric with a CLOS topology, where leaf switches are connected to GPU hosts and spine switches provide the Scale-Out connectivity to GPUs in the cluster. This presentation will delve into the progressive evolution of our network builds, specifically tailored to support the demanding requirements of AI services. Attendees will gain insights into the challenges encountered, innovative solutions implemented, and the strategic considerations behind building an efficient and high-performance fabric for AI workloads at Meta.
Meta has been operating RoCE-based distributed training clusters serving internal AI training workloads since 2020. One major challenge surfaced in the early days was the job performance inconsistency over different job scheduling schemes and network failures. This was attributed to the static routing scheme we employed and triggered us to proceed on multiple paths to address them.
Centralized Traffic Engineering, which dynamically places traffic over all available paths in a load balanced manner, is one of the most promising solutions we have adopted to address the challenge. In this talk, we will go over the design, development, evaluation, and operational experience of the centralized traffic engineering solution.
In this talk we provide an overview of Meta's RDMA deployment based on RoCEV2 transport for supporting our production AI Training infrastructure. We will shed light on how we designed our infrastructure to both maximize raw performance and consistency that is fundamental for the workload. We will talk about the challenges we solved in Routing, Transport and Hardware layers we solved along the way to scale our infrastructure. We will also touch on opportunities that remain in this space to make further progress over the next few years.
High-performance and reliable collective communication over AI-Zone RDMA network, is foundational for enabling and scaling Meta AI training / inference workloads. It is necessary to capture top-down observability from workload to network for collective communication, and therefore attribute performance regression and training failures to backend network. For this purpose, we introduced two important tools: ROCET and PARAM benchmark and Chakra ecosystems. We build ROCET to associates the job to RDMA network metrics and provide analysis on top. In addition, we build PARAM benchmark to allow analyzing and tuning collective communication operations through workload trace, and recently scale them to the community with Chakra for co-designing efficient distributed ML systems. In this talk, we will go over their design and use cases.
This presentation will introduce Arcadia, a unified system designed to simulate compute, memory, and network performance of AI training clusters. By providing a multi-disciplinary performance analysis framework, Arcadia aims to facilitate the design and optimization of various system levels, including application, network, and hardware. This comprehensive system enables researchers and practitioners to gain valuable insights into the performance of future AI models and workloads on specific infrastructures, fostering data-driven decision-making processes and promoting the future evolution of models and hardware. Arcadia provides ability to simulate performance impact of scheduled operational tasks on AI-models that are running in production; helps an engineer to make job-aware decisions during day-to-day operational activity. Attendees will learn about the capabilities and potential impact of Arcadia in advancing the field of AI systems and infrastructure.
Rajiv is a Software Engineering Director in the Network Infrastructure group at Meta. He... read more
Tanuja Ingale is a Technical Program Manager in the Production Network Infrastructure group at... read more
Jongsoo is a research scientist at Meta, AI Systems Co-design team, optimizing SW for... read more
Petr Lapukhov is a Network Engineer who spent nearly ten years at Meta and,... read more
Hany Morsy is a highly skilled Network Engineer with over 25 years of experience... read more
Susana Contrera is an Infrastructure Network Engineer at Meta. Her team is a key... read more
Shuqiang Zhang is a Software Engineer at Meta. He is currently working on performance... read more
Jingyi Yang is a software engineer on the network.ai team at Meta where she... read more
James Zeng currently leads AI Networking Software team at Meta. Since joining Meta in... read more
At Meta, I lead RDMA Network design and deployments for AI workloads. Before this,... read more
Shengbao is a Research Scientist at Meta. He is part of the AI Networking... read more
Zhaodong Wang is a research scientist and Tech lead at Meta network infra team.... read more
I am a Network Modeling and Optimization Engineer at Meta. Before Meta, I was... read more
Joseph Provine supports NIC and AI Transport teams at Meta. Prior to Meta he... read more