@Scale: Networking

August 13, 2025

Hosted In Person & Virtually
Santa Clara Convention Center

In 2025, @Scale: Networking will continue to focus on the evolution of AI Networking. To address the growing complexity of network operations, we will examine a full-stack perspective towards debugging, encompassing the communications layer through to the hardware. By adopting a holistic approach across both our front-end and back-end networks we can identify and mitigate potential bottlenecks, ensuring optimal network performance. You will also hear from industry experts and leading researchers who are at the forefront of building large scale networks. Attendees will benefit from the opportunity to learn about diverse approaches to solve common challenges and explore potential collaborations.

Joining us are speakers from AMD, Broadcom, ByteDance, Cisco, Google, Meta, Microsoft, NVIDIA, and Oracle Cloud Infrastructure!

RSVPS CLOSED

AGENDA SPEAKERS

EVENT AGENDA

Event times below are displayed in PT.

August 13

08:30 AM - 09:45 AM

Registration

08:30 AM - 09:45 AM

Breakfast, Raffle Submissions, and Networking

GENERAL SESSION (MISSION CITY BALLROOM)

09:45 AM - 10:05 AM

Event Welcome

WATCH NOW

Speaker Vignesh Vijayanath,Meta

10:05 AM - 10:25 AM

Keynote from Meta

WATCH NOW

Speaker Gaya Nagarajan,Meta

Track 1 - Network Technology Evolution

(Mission City Ballroom)

Track 2 - Post Training and Inference

(2nd Floor Theater)

10:45 AM - 11:05 AM

Meta’s DC Networks for Generative AI

WATCH NOW

This presentation provides context on how Generative AI has put demands for the bigger, more scalable and performant network. Presentation provides historical perspective, our journey and challenges to scale to 100K.

Speaker Rohit Puri,Meta

Speaker Hany Morsy,Meta

11:15 AM - 11:35 AM

RDMA at Cloud Scale: The OCI Experience

WATCH NOW

We will discuss OCI's journey of RDMA in the Cloud. We cover key design requirements, techniques used to meet the requirements and challenges encountered. We will end with a look towards the future.

Speaker Jag Brar,Oracle

Speaker David Becker,Oracle

11:45 AM - 12:05 PM

Scaling AI Network with DSF

WATCH NOW

The Gen-AI boom in 2023 has initiated a surge in demand for high-performance, low-latency, and lossless AI networks to support large-scale model training. In response, Meta started on a journey to develop scalable AI networks, with the focus on Distributed Switch Fabric (DSF). DSF's modular architecture is designed to optimize load balancing and congestion control, ensuring high performance for both intra and inter-cluster traffic. This talk explores the challenges and innovations surrounding DSF, and discusses future directions, including the creation of mega clusters through DSF and non-DSF region interconnectivity, as well as the exploration of alternative switching technologies.

Speaker Ron He,Meta

Speaker Ankur Singh,Meta

12:05 PM - 12:25 PM

Track 1 - Live Q&A Session #1

WATCH NOW

Moderator Srilakshmi Adusumali,Meta

Speaker Rohit Puri,Meta

Speaker Jag Brar,Oracle

Speaker David Becker,Oracle

Speaker Ron He,Meta

Speaker Ankur Singh,Meta

12:25 PM - 01:25 PM

Lunch & Networking (Exhibit Hall A)

10:45 AM - 11:05 AM

Inference Deployments and Comms Implication

WATCH NOW

This talk addresses the challenges and solutions for scaling large language model (LLM) inference to support up to 1 billion monthly active users across platforms for Meta AI, focusing on compute-bound prefill and memory-bound decode stages. Key challenges include the quadratic scaling of attention operations with sequence length and the linear growth of the KV cache, along with network-intensive operations impacting latency. To enhance scaling efficiency, a multi-dimensional parallelism strategy is proposed across various hardware platforms, including Nvidia and AMD. Innovations such as Context Parallelism (CP) and iRoPE enable near-linear prefill scaling, while optimized communication techniques like Dynamic/Persistent All-to-All for Expert Parallelism (EP) and Direct Data Access (DDA) for Tensor Parallelism (TP) significantly improve performance. Future efforts aim to further enhance system efficiency through fused kernels and device-initiated operations.

Speaker Cen Zhao,Meta

Speaker Xiaodong Wang,Meta

Speaker Jianyu Huang,Meta

11:15 AM - 11:35 AM

Enhancing Runtime Reliability in LLM Training via Fine-Grained Observability

WATCH NOW

As large language model (LLM) training scales across tens of thousands of GPUs, ensuring runtime reliability becomes both more challenging and more critical for maintaining efficiency. This talk explores how fine-grained observability can substantially enhance reliability in LLM training at scale. First, we discuss automated methods for detecting faulty machines by leveraging distinctive monitoring metric patterns, enabling rapid and accurate identification of problematic nodes while minimizing manual intervention. Second, we tackle reliability challenges within collective communication libraries (CCL), introducing a lightweight tracing and root cause analysis system that treats CCL as system software and reveals internal control and data dependencies. This approach allows for swift and precise detection of communication-related anomalies. Collectively, these advancements illustrate how fine-grained observability at both the machine and communication levels can significantly improve the robustness and operational efficiency of large-scale LLM training.

Speaker Lei Zhang,ByteDance

11:45 AM - 12:05 PM

PyTorch Symmetric Memory: A New Paradigm for Programming Distributed AI

WATCH NOW

Recent model advancements have highlighted the need for customized communication. In response, PyTorch introduces Symmetric Memory, a distributed programming model that creates a global address space for data spanning multiple GPUs' memory. In this talk, we will demonstrate how developers can author their own communication kernels at the device level. Additionally, we will show how to interleave communication and computation within the same kernel using popular languages like Triton, achieving the finest-grained fusion possible. We will also discuss key network technologies for scaling symmetric memory across nodes.

Speaker Ke Wen,Meta

Speaker Natalia Gimelshein,Meta

12:05 PM - 12:25 PM

Track 2 - Live Q&A Session #1

WATCH NOW

Moderator James Zeng,Meta

Speaker Cen Zhao,Meta

Speaker Xiaodong Wang,Meta

Speaker Jianyu Huang,Meta

Speaker Ke Wen,Meta

Speaker Natalia Gimelshein,Meta

Speaker Lei Zhang,ByteDance

12:25 PM - 01:25 PM

Lunch & Networking (Exhibit Hall A)

Track 1 - AI Networks Scaling

(Mission City Ballroom)

Track 2 - ML Systems Scaling

(2nd Floor Theater)

01:35 PM - 01:55 PM

Transparent MultiNIC routing for large AI Models

WATCH NOW

In large scale AI training models necessitates the transfer of terabits of data per second for various needs eg checkpointing, data ingestion, and hot sparing. However, current network configurations and a lack of application awareness regarding underlying hardware resources often result in suboptimal resource utilization, leading to delayed checkpoint flushes, increased GPU idle, and failover latency.

We present a transparent multi-NIC routing solution that eliminates these bottlenecks for both egress and ingress traffic, improving NIC utilization for large-scale AI models.

Speaker Yingjie Gu,Meta

Speaker Takshak Chahande,Meta

Featured Article

Transparent Multi-NIC Routing for Large AI Models read more

02:05 PM - 02:25 PM

Architecting Multi-tenant Data-center Networks for GPU Customers

WATCH NOW

Generative AI is revolutionizing cloud data centers, pushing the limits of what is possible in computing. While the industry already knows how to virtualize the regular data-center networks, virtualizing the GPU networks in a cloud introduces new challenges. In our talk we will share Google’s architecture, how we create cutting-edge cloud data centers tailored for GenAI workloads, and the experience with our choice of GPU NIC and its SDK to ensure exceptional performance, scalability, efficiency, security, operability and seamless integration with existing systems.

Speaker Chang Kim,Google

Speaker Weilong Cui,Google

02:25 PM - 02:45 PM

Track 1 - Live Q&A Session #2

WATCH NOW

Moderator Shashi Gandham,META

Speaker Chang Kim,Google

Speaker Weilong Cui,Google

Speaker Yingjie Gu,Meta

Speaker Takshak Chahande,Meta

01:35 PM - 01:55 PM

Performance Optimizations at 100K+ Scale

WATCH NOW

Presentation information coming soon!

Speaker Ashmitha Jeevaraj Shetty,Meta

Speaker Min Si,Meta

02:05 PM - 02:25 PM

Scaling Llama4 Training to 100K

WATCH NOW

Llama 4's pre-training scale is growing exponentially, with 100K GPUs used, a 6x increase from its predecessor. Initializing training takes longer, and failure probability increases with larger scale. Training throughput aka Effective Training time degrades significantly as a result. To address these challenges, researchers are experimenting in parallel for faster initialization of large scale jobs, and fault-tolerant paradigms.

Speaker Saif Hasan,Meta

Speaker Omkar Salpekar,Meta

02:25 PM - 02:45 PM

Track 2 - Live Q&A Session #2

WATCH NOW

Moderator Adi Gangidi,META

Speaker Ashmitha Jeevaraj Shetty,Meta

Speaker Min Si,Meta

Speaker Saif Hasan,Meta

Speaker Omkar Salpekar,Meta

GENERAL SESSION (MISSION CITY BALLROOM)

02:45 PM - 03:15 PM

Networking Break

03:15 PM - 03:35 PM

10x Backbone: Scaling Backbone Connectivity to Serve AI Demands

WATCH NOW

In this presentation we will share details about Meta's Backbone Network, its recent developments, and the journey to support increasing demands that our existing and new AI workloads place on the network. New technologies and designs to address 10x scaling needs in the are discussed as well as how some of these same principles are being applied to the emerging requirement of extending AI clusters across the 10km boundary, between multiple DCs.

Speaker Mark McKillop,Meta

Speaker Alberto Herrero Mediavilla,Meta

03:35 PM - 03:55 PM

Keynote from Microsoft

WATCH NOW

Speaker Pradeep Sindhu,Microsoft

03:55 PM - 04:15 PM

Live Technology Panel

WATCH NOW

Panelist Mohan Kalkunte,Broadcom

Panelist Rakesh Chopra,Cisco

Panelist Yuval Degani,NVIDIA

Panelist Krishna Doddapaneni,AMD

Panelist Rajiv Krishnamurthy,Meta

Moderator Omar Baldonado,Meta

04:15 PM - 04:20 PM

Closing Remarks

Speaker Omar Baldonado,Meta

04:20 PM - 06:00 PM

Networking Happy Hour

SPEAKERS AND MODERATORS

Vignesh Vijayanath is a Technical Program Management leader at Meta, where he specializes in... read more

Vignesh Vijayanath

Meta

Gaya Nagarajan joined Meta in 2012 as Network Engineer and currently serves as the... read more

Gaya Nagarajan

Meta

Rohit Puri is a Network Software Engineer at Meta , specializing in Network AI.... read more

Rohit Puri

Meta

Hany Morsy is a seasoned Network Engineer with over 20 years of experience in... read more

Hany Morsy

Meta

Jag Brar is Vice President and Distinguished Engineer at Oracle Cloud Infrastructure (OCI). He... read more

Jag Brar

Oracle

David Becker is an Architect for GPU Clusters at Oracle Cloud Infrastructure (OCI). He... read more

David Becker

Oracle

Ron He is a software engineer in the FBOSS (Facebook Open Switch System) team,... read more

Ron He

Meta

Ankur Singh is a Network AI Engineer at Meta, working on the development of... read more

Ankur Singh

Meta

Srilakshmi Adusumalli is a Software Engineering Manager at Meta, supporting the FBOSS team. FBOSS... read more

Srilakshmi Adusumali

Meta

Cen's been working at Meta for about 10 years, spending most of his time... read more

Cen Zhao

Meta

Xiaodong Wang is a research scientist in PyTorch AI Acceleration team at Meta. He... read more

Xiaodong Wang

Meta

Jianyu Huang is a research scientist at Meta, specializing in enhancing the efficiency of... read more

Jianyu Huang

Meta

Lei Zhang is a research scientist at ByteDance. His research interests are broadly in... read more

Lei Zhang

ByteDance

Ke Wen is a developer of PyTorch Distributed. His interests include Symmetric Memory, irregular... read more

Ke Wen

Meta

Natalia Gimelshein has been working on pytorch for more than 5 years. She made... read more

Natalia Gimelshein

Meta

James Zeng currently leads AI Networking Software team at Meta. Since joining Meta in... read more

James Zeng

Meta

Yingjie is a Software Engineer at Meta. She leads the development of the End... read more

Yingjie Gu

Meta

Takshak is a Software Engineer at Meta. read more

Takshak Chahande

Meta

Chang (Changhoon) Kim is a Principal Engineer at Google and works on various ML... read more

Chang Kim

Google

Weilong Cui is a software engineer at Google. He is working on datacenter networking... read more

Weilong Cui

Google

Software Engineering Director at Meta. read more

Shashi Gandham

Ashmitha Jeevaraj Shetty

Meta

Dr. Min Si is a Research Scientist at Meta. Min contributes to the aspect... read more

Min Si

Meta

Saif is a Software Engineer at Meta, where he leads the Collective Communication stack.... read more

Saif Hasan

Meta

Omkar is a Software Engineer at Meta. read more

Omkar Salpekar

Meta

Adi is a Hardware Systems Engineer at Meta. read more

Adi Gangidi

Mark McKillop

Meta

Alberto is a Production Engineer at Meta. His focus is on designing, building and... read more

Alberto Herrero Mediavilla

Meta

Dr. Pradeep Sindhu is an industry visionary currently focused on data processing innovations at... read more

Pradeep Sindhu

Microsoft

Mohan Kalkunte is Vice President of Architecture & Technology in the Core Switch Products... read more

Mohan Kalkunte

Broadcom

Rakesh Chopra is a Senior Vice President and Fellow in Cisco's Common Hardware Group,... read more

Rakesh Chopra

Cisco

Yuval Degani is a Senior Director of Engineering at NVIDIA, leading Hyperscale AI Networking... read more

Yuval Degani

NVIDIA

Krishna Doddapaneni is a currently serving as Corporate Vice President of Software Engineering at... read more

Krishna Doddapaneni

AMD

Rajiv Krishnamurthy is a Software Engineering Director in the Network Infrastructure group at Meta.... read more

Rajiv Krishnamurthy

Meta

Omar is an Engineering Director at Meta. read more

Omar Baldonado

Meta

2026 Events

The @Scale 2026 event series is on the way, with official dates, locations, and full event details coming soon!

@Scale returns with four live, in-person shows focused on a series of topics and engineers who build or maintain systems designed for scale: Systems & Reliability, AI & Data, Networking, and Product. Each event will also be available via live stream, making it easy to join from anywhere!

Want to be one of the first to know when dates are released and registration opens? Sign up for our newsletter below to receive announcements and early updates as soon as they’re available. Until then, you can explore highlights from last year on the @Scale YouTube channel at youtube.com/@scaleconference.

PRODUCT - PAST EVENT

Hosted In Person & Virtually
Meta Campus, Menlo Park

@Scale: Product is an exciting evolution of the conference series, bringing together the best of Product @Scale, RTC @Scale, Mobile @Scale, and Video @Scale. This comprehensive program is designed for engineers who are passionate about building and optimizing large-scale products. Attendees will gain insights into the latest innovations, best practices, and tools that drive efficiency and performance across product development, real-time communication, mobile platforms, and video technologies.

SYSTEMS & RELIABILITY - PAST EVENT

Hosted In Person & Virtually
Meta Campus, Menlo Park

The first installment of the 2025 @Scale conference series will combine two of the most foundational topics across the stack, Systems & Reliability. This two-track program will feature technical talks about the demands of AI and the conference theme of "rising to the challenge." The themed talks will include compelling stories about solving the hardest hyper-scale problems with distributed systems, infra resilience and many more complex challenges by speakers from around the industry.

Past Event

AI & DATA - PAST EVENT

Meta’s Engineering and Infrastructure teams are excited to bring together a global contingent of engineers who are interested in building, operating, and using AI and data systems at scale.

This year, we will focus on building a world in which Agents interact with billions of users, a critical step towards unlocking the full potential of AI and data systems. Our in-person talks and panels will delve into the latest advancements in agent development, deployment, and product integration, featuring expert insights on topics such as data for agents, agent tools & environments, safety, and privacy. Attendees can expect to gain practical knowledge and strategies for building AI-powered products, as well as a deeper understanding of the evolving ecosystem and its implications for traditional BI and product analytics.

Past Event

NETWORKING - PAST EVENT

Hosted In Person & Virtually
Santa Clara Convention Center

Joining us are speakers from AMD, Broadcom, ByteDance, Cisco, Google, Meta, Microsoft, NVIDIA, and Oracle Cloud Infrastructure!

Past Event

LATEST NOTES

Networking @Scale

12/17/2025

Transparent Multi-NIC Routing for Large AI Models

Additional Author: Raman Sukhau AI training is scaling faster than ever, and with it, the demands on data center network...