@Scale: Networking

August 13, 2025

Hosted In Person & Virtually
Santa Clara Convention Center

In 2025, @Scale: Networking will continue to focus on the evolution of AI Networking. To address the growing complexity of network operations, we will examine a full-stack perspective towards debugging, encompassing the communications layer through to the hardware. By adopting a holistic approach across both our front-end and back-end networks we can identify and mitigate potential bottlenecks, ensuring optimal network performance. You will also hear from industry experts and leading researchers who are at the forefront of building large scale networks. Attendees will benefit from the opportunity to learn about diverse approaches to solve common challenges and explore potential collaborations.

Joining us are speakers from AMD, Broadcom, ByteDance, Cisco, Google, Meta, Microsoft, NVIDIA, and Oracle Cloud Infrastructure!

Register today and stay tuned for upcoming speaker & agenda announcements.

Agenda at a glance

View full agenda

Event times below are displayed in PT.

View full agenda
August 13
08:30 AM - 09:45 AM
Registration
08:30 AM - 09:45 AM
Breakfast, Raffle Submissions, and Networking
GENERAL SESSION (MISSION CITY BALLROOM)
09:45 AM - 10:05 AM
Event Welcome
10:05 AM - 10:25 AM
Keynote from Meta
Track 1 - Network Technology Evolution
(Mission City Ballroom)
10:45 AM - 11:05 AM
Meta’s DC Networks for Generative AI
11:15 AM - 11:35 AM
RDMA at Cloud Scale: The OCI Experience
11:45 AM - 12:05 PM
Scaling AI Network with DSF
12:05 PM - 12:25 PM
Track 1 - Live Q&A Session #1
12:25 PM - 01:25 PM
Lunch & Networking (Exhibit Hall A)
Track 2 - Post Training and Inference
(2nd Floor Theater)
10:45 AM - 11:05 AM
Inference Deployments and Comms Implication
11:15 AM - 11:35 AM
Enhancing Runtime Reliability in LLM Training via Fine-Grained Observability
11:45 AM - 12:05 PM
PyTorch Symmetric Memory: A New Paradigm for Programming Distributed AI
12:05 PM - 12:25 PM
Track 2 - Live Q&A Session #1
12:25 PM - 01:25 PM
Lunch & Networking (Exhibit Hall A)
Track 1 - AI Networks Scaling
(Mission City Ballroom)
01:35 PM - 01:55 PM
Transparent MultiNIC routing for large AI Models
02:05 PM - 02:25 PM
Architecting Multi-tenant Data-center Networks for GPU Customers
02:25 PM - 02:45 PM
Track 1 - Live Q&A Session #2
Track 2 - ML Systems Scaling
(2nd Floor Theater)
01:35 PM - 01:55 PM
Performance Optimizations at 100K+ Scale
02:05 PM - 02:25 PM
Scaling Llama4 Training to 100K
02:25 PM - 02:45 PM
Track 2 - Live Q&A Session #2
GENERAL SESSION (MISSION CITY BALLROOM)
02:45 PM - 03:15 PM
Networking Break
03:15 PM - 03:35 PM
10x Backbone: Scaling Backbone Connectivity to Serve AI Demands
03:35 PM - 03:55 PM
Keynote from Microsoft
03:55 PM - 04:15 PM
Live Technology Panel
04:15 PM - 04:20 PM
Closing Remarks
04:20 PM - 06:00 PM
Networking Happy Hour

SPEAKERS AND MODERATORS

Vignesh Vijayanath is a Technical Program Management leader at Meta, where he specializes in... read more

Vignesh Vijayanath

Meta

Gaya Nagarajan joined Meta in 2012 as Network Engineer and currently serves as the... read more

Gaya Nagarajan

Meta

Rohit Puri is a Network Software Engineer at Meta , specializing in Network AI.... read more

Rohit Puri

Meta

Hany Morsy is a seasoned Network Engineer with over 20 years of experience in... read more

Hany Morsy

Meta

Jag Brar is Vice President and Distinguished Engineer at Oracle Cloud Infrastructure (OCI). He... read more

Jag Brar

Oracle

David Becker is an Architect for GPU Clusters at Oracle Cloud Infrastructure (OCI). He... read more

David Becker

Oracle

Ron He is a software engineer in the FBOSS (Facebook Open Switch System) team,... read more

Ron He

Meta

Ankur Singh is a Network AI Engineer at Meta, working on the development of... read more

Ankur Singh

Meta

Srilakshmi Adusumalli is a Software Engineering Manager at Meta, supporting the FBOSS team. FBOSS... read more

Srilakshmi Adusumali

Meta

Cen's been working at Meta for about 10 years, spending most of his time... read more

Cen Zhao

Meta

Xiaodong Wang is a research scientist in PyTorch AI Acceleration team at Meta. He... read more

Xiaodong Wang

Meta

Jianyu Huang is a research scientist at Meta, specializing in enhancing the efficiency of... read more

Jianyu Huang

Meta

Lei Zhang is a research scientist at ByteDance. His research interests are broadly in... read more

Lei Zhang

ByteDance

Ke Wen is a developer of PyTorch Distributed. His interests include Symmetric Memory, irregular... read more

Ke Wen

Meta

Natalia Gimelshein has been working on pytorch for more than 5 years. She made... read more

Natalia Gimelshein

Meta

James Zeng currently leads AI Networking Software team at Meta. Since joining Meta in... read more

James Zeng

Meta

Yingjie is a Software Engineer at Meta. She leads the development of the End... read more

Yingjie Gu

Meta

Takshak is a Software Engineer at Meta. read more

Takshak Chahande

Meta

Chang (Changhoon) Kim is a Principal Engineer at Google and works on various ML... read more

Chang Kim

Google

Weilong Cui is a software engineer at Google. He is working on datacenter networking... read more

Weilong Cui

Google

Software Engineering Director at Meta. read more

Shashi Gandham

META

Ashmitha Shetty is a network engineer at Meta. She is a part of the... read more

Ashmitha Jeevaraj Shetty

Meta

Dr. Min Si is a Research Scientist at Meta. Min contributes to the aspect... read more

Min Si

Meta

Saif is a Software Engineer at Meta, where he leads the Collective Communication stack.... read more

Saif Hasan

Meta

Omkar is a Software Engineer at Meta. read more

Omkar Salpekar

Meta

Adi is a Hardware Systems Engineer at Meta. read more

Adi Gangidi

META

Mark McKillop is a Production Engineer in the Backbone Engineering Team. He has been... read more

Mark McKillop

Meta

Alberto is a Production Engineer at Meta. His focus is on designing, building and... read more

Alberto Herrero Mediavilla

Meta

Dr. Pradeep Sindhu is an industry visionary currently focused on data processing innovations at... read more

Pradeep Sindhu

Microsoft

Mohan Kalkunte is Vice President of Architecture & Technology in the Core Switch Products... read more

Mohan Kalkunte

Broadcom

Rakesh Chopra is a Senior Vice President and Fellow in Cisco's Common Hardware Group,... read more

Rakesh Chopra

Cisco

Yuval Degani is a Senior Director of Engineering at NVIDIA, leading Hyperscale AI Networking... read more

Yuval Degani

NVIDIA

Krishna Doddapaneni is a currently serving as Corporate Vice President of Software Engineering at... read more

Krishna Doddapaneni

AMD

Rajiv Krishnamurthy is a Software Engineering Director in the Network Infrastructure group at Meta.... read more

Rajiv Krishnamurthy

Meta

Omar is an Engineering Director at Meta. read more

Omar Baldonado

Meta

EVENT AGENDA

Event times below are displayed in PT.

August 13

08:30 AM - 09:45 AM
Registration
08:30 AM - 09:45 AM
Breakfast, Raffle Submissions, and Networking
GENERAL SESSION (MISSION CITY BALLROOM)
09:45 AM - 10:05 AM
Event Welcome
Speaker Vignesh Vijayanath,Meta
10:05 AM - 10:25 AM
Keynote from Meta
Speaker Gaya Nagarajan,Meta

Track 1 - Network Technology Evolution

(Mission City Ballroom)

Track 2 - Post Training and Inference

(2nd Floor Theater)

10:45 AM - 11:05 AM
Meta’s DC Networks for Generative AI

This presentation provides context on how Generative AI has put demands for the bigger, more scalable and performant network. Presentation provides historical perspective, our journey and challenges to scale to 100K.

Speaker Rohit Puri,Meta
Speaker Hany Morsy,Meta
11:15 AM - 11:35 AM
RDMA at Cloud Scale: The OCI Experience

We will discuss OCI's journey of RDMA in the Cloud. We cover key design requirements, techniques used to meet the requirements and challenges encountered. We will end with a look towards the future.

Speaker Jag Brar,Oracle
Speaker David Becker,Oracle
11:45 AM - 12:05 PM
Scaling AI Network with DSF

The Gen-AI boom in 2023 has initiated a surge in demand for high-performance, low-latency, and lossless AI networks to support large-scale model training. In response, Meta started on a journey to develop scalable AI networks, with the focus on Distributed Switch Fabric (DSF). DSF's modular architecture is designed to optimize load balancing and congestion control, ensuring high performance for both intra and inter-cluster traffic. This talk explores the challenges and innovations surrounding DSF, and discusses future directions, including the creation of mega clusters through DSF and non-DSF region interconnectivity, as well as the exploration of alternative switching technologies.

Speaker Ron He,Meta
Speaker Ankur Singh,Meta
12:05 PM - 12:25 PM
Track 1 - Live Q&A Session #1
Moderator Srilakshmi Adusumali,Meta
Speaker Rohit Puri,Meta
Speaker Jag Brar,Oracle
Speaker David Becker,Oracle
Speaker Ron He,Meta
Speaker Ankur Singh,Meta
12:25 PM - 01:25 PM
Lunch & Networking (Exhibit Hall A)
10:45 AM - 11:05 AM
Inference Deployments and Comms Implication

This talk addresses the challenges and solutions for scaling large language model (LLM) inference to support up to 1 billion monthly active users across platforms for Meta AI, focusing on compute-bound prefill and memory-bound decode stages. Key challenges include the quadratic scaling of attention operations with sequence length and the linear growth of the KV cache, along with network-intensive operations impacting latency. To enhance scaling efficiency, a multi-dimensional parallelism strategy is proposed across various hardware platforms, including Nvidia and AMD. Innovations such as Context Parallelism (CP) and iRoPE enable near-linear prefill scaling, while optimized communication techniques like Dynamic/Persistent All-to-All for Expert Parallelism (EP) and Direct Data Access (DDA) for Tensor Parallelism (TP) significantly improve performance. Future efforts aim to further enhance system efficiency through fused kernels and device-initiated operations.

Speaker Cen Zhao,Meta
Speaker Xiaodong Wang,Meta
Speaker Jianyu Huang,Meta
11:15 AM - 11:35 AM
Enhancing Runtime Reliability in LLM Training via Fine-Grained Observability

As large language model (LLM) training scales across tens of thousands of GPUs, ensuring runtime reliability becomes both more challenging and more critical for maintaining efficiency. This talk explores how fine-grained observability can substantially enhance reliability in LLM training at scale. First, we discuss automated methods for detecting faulty machines by leveraging distinctive monitoring metric patterns, enabling rapid and accurate identification of problematic nodes while minimizing manual intervention. Second, we tackle reliability challenges within collective communication libraries (CCL), introducing a lightweight tracing and root cause analysis system that treats CCL as system software and reveals internal control and data dependencies. This approach allows for swift and precise detection of communication-related anomalies. Collectively, these advancements illustrate how fine-grained observability at both the machine and communication levels can significantly improve the robustness and operational efficiency of large-scale LLM training.

Speaker Lei Zhang,ByteDance
11:45 AM - 12:05 PM
PyTorch Symmetric Memory: A New Paradigm for Programming Distributed AI

Recent model advancements have highlighted the need for customized communication. In response, PyTorch introduces Symmetric Memory, a distributed programming model that creates a global address space for data spanning multiple GPUs' memory. In this talk, we will demonstrate how developers can author their own communication kernels at the device level. Additionally, we will show how to interleave communication and computation within the same kernel using popular languages like Triton, achieving the finest-grained fusion possible. We will also discuss key network technologies for scaling symmetric memory across nodes.

Speaker Ke Wen,Meta
Speaker Natalia Gimelshein,Meta
12:05 PM - 12:25 PM
Track 2 - Live Q&A Session #1
Moderator James Zeng,Meta
Speaker Cen Zhao,Meta
Speaker Xiaodong Wang,Meta
Speaker Jianyu Huang,Meta
Speaker Ke Wen,Meta
Speaker Natalia Gimelshein,Meta
Speaker Lei Zhang,ByteDance
12:25 PM - 01:25 PM
Lunch & Networking (Exhibit Hall A)

Track 1 - AI Networks Scaling

(Mission City Ballroom)

Track 2 - ML Systems Scaling

(2nd Floor Theater)

01:35 PM - 01:55 PM
Transparent MultiNIC routing for large AI Models

In large scale AI training models necessitates the transfer of terabits of data per second for various needs eg checkpointing, data ingestion, and hot sparing. However, current network configurations and a lack of application awareness regarding underlying hardware resources often result in suboptimal resource utilization, leading to delayed checkpoint flushes, increased GPU idle, and failover latency.

We present a transparent multi-NIC routing solution that eliminates these bottlenecks for both egress and ingress traffic, improving NIC utilization for large-scale AI models.

Speaker Yingjie Gu,Meta
Speaker Takshak Chahande,Meta
02:05 PM - 02:25 PM
Architecting Multi-tenant Data-center Networks for GPU Customers

Generative AI is revolutionizing cloud data centers, pushing the limits of what is possible in computing. While the industry already knows how to virtualize the regular data-center networks, virtualizing the GPU networks in a cloud introduces new challenges. In our talk we will share Google’s architecture, how we create cutting-edge cloud data centers tailored for GenAI workloads, and the experience with our choice of GPU NIC and its SDK to ensure exceptional performance, scalability, efficiency, security, operability and seamless integration with existing systems.

Speaker Chang Kim,Google
Speaker Weilong Cui,Google
02:25 PM - 02:45 PM
Track 1 - Live Q&A Session #2
Moderator Shashi Gandham,META
Speaker Chang Kim,Google
Speaker Weilong Cui,Google
Speaker Yingjie Gu,Meta
Speaker Takshak Chahande,Meta
01:35 PM - 01:55 PM
Performance Optimizations at 100K+ Scale

Presentation information coming soon!

Speaker Ashmitha Jeevaraj Shetty,Meta
Speaker Min Si,Meta
02:05 PM - 02:25 PM
Scaling Llama4 Training to 100K

Llama 4's pre-training scale is growing exponentially, with 100K GPUs used, a 6x increase from its predecessor. Initializing training takes longer, and failure probability increases with larger scale. Training throughput aka Effective Training time degrades significantly as a result. To address these challenges, researchers are experimenting in parallel for faster initialization of large scale jobs, and fault-tolerant paradigms.

Speaker Saif Hasan,Meta
Speaker Omkar Salpekar,Meta
02:25 PM - 02:45 PM
Track 2 - Live Q&A Session #2
Moderator Adi Gangidi,META
Speaker Ashmitha Jeevaraj Shetty,Meta
Speaker Min Si,Meta
Speaker Saif Hasan,Meta
Speaker Omkar Salpekar,Meta
GENERAL SESSION (MISSION CITY BALLROOM)
02:45 PM - 03:15 PM
Networking Break
03:15 PM - 03:35 PM
10x Backbone: Scaling Backbone Connectivity to Serve AI Demands

In this presentation we will share details about Meta's Backbone Network, its recent developments, and the journey to support increasing demands that our existing and new AI workloads place on the network. New technologies and designs to address 10x scaling needs in the are discussed as well as how some of these same principles are being applied to the emerging requirement of extending AI clusters across the 10km boundary, between multiple DCs.

Speaker Mark McKillop,Meta
Speaker Alberto Herrero Mediavilla,Meta
03:35 PM - 03:55 PM
Keynote from Microsoft
Speaker Pradeep Sindhu,Microsoft
03:55 PM - 04:15 PM
Live Technology Panel
Panelist Mohan Kalkunte,Broadcom
Panelist Rakesh Chopra,Cisco
Panelist Yuval Degani,NVIDIA
Panelist Krishna Doddapaneni,AMD
Panelist Rajiv Krishnamurthy,Meta
Moderator Omar Baldonado,Meta
04:15 PM - 04:20 PM
Closing Remarks
Speaker Omar Baldonado,Meta
04:20 PM - 06:00 PM
Networking Happy Hour

2025 Events

@Scale is a technical conference series for engineers who build or maintain systems designed for scale. New this year, in person and virtual attendance options will be available at all four of our programs, which will bring together complementary themes to create event communities to spark cross-discipline collaboration.

PRODUCT - OCTOBER 22, 2025

Hosted In Person & Virtually
Meta Campus, Menlo Park

@Scale: Product is an exciting evolution of the conference series, bringing together the best of Product @Scale, RTC @Scale, Mobile @Scale, and Video @Scale. This comprehensive program is designed for engineers who are passionate about building and optimizing large-scale products. Attendees will gain insights into the latest innovations, best practices, and tools that drive efficiency and performance across product development, real-time communication, mobile platforms, and video technologies.

Register today and stay tuned for upcoming agenda announcements.

SYSTEMS & RELIABILITY - PAST EVENT

Hosted In Person & Virtually
Meta Campus, Menlo Park

The first installment of the 2025 @Scale conference series will combine two of the most foundational topics across the stack, Systems & Reliability. This two-track program will feature technical talks about the demands of AI and the conference theme of "rising to the challenge." The themed talks will include compelling stories about solving the hardest hyper-scale problems with distributed systems, infra resilience and many more complex challenges by speakers from around the industry.

AI & DATA - PAST EVENT

Meta’s Engineering and Infrastructure teams are excited to bring together a global contingent of engineers who are interested in building, operating, and using AI and data systems at scale.

This year, we will focus on building a world in which Agents interact with billions of users, a critical step towards unlocking the full potential of AI and data systems. Our in-person talks and panels will delve into the latest advancements in agent development, deployment, and product integration, featuring expert insights on topics such as data for agents, agent tools & environments, safety, and privacy. Attendees can expect to gain practical knowledge and strategies for building AI-powered products, as well as a deeper understanding of the evolving ecosystem and its implications for traditional BI and product analytics.

Register today and learn how you can win a pair of Ray-Ban Meta Wayfarers!

NETWORKING - PAST EVENT

Hosted In Person & Virtually
Santa Clara Convention Center

In 2025, @Scale: Networking will continue to focus on the evolution of AI Networking. To address the growing complexity of network operations, we will examine a full-stack perspective towards debugging, encompassing the communications layer through to the hardware. By adopting a holistic approach across both our front-end and back-end networks we can identify and mitigate potential bottlenecks, ensuring optimal network performance. You will also hear from industry experts and leading researchers who are at the forefront of building large scale networks. Attendees will benefit from the opportunity to learn about diverse approaches to solve common challenges and explore potential collaborations.

Joining us are speakers from AMD, Broadcom, ByteDance, Cisco, Google, Meta, Microsoft, NVIDIA, and Oracle Cloud Infrastructure!

Register today and stay tuned for upcoming speaker & agenda announcements.

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy