@Scale: Systems & Reliability

MAY 6, 2025

Hosted In Person & Virtually
Meta Campus, Menlo Park

The first installment of the 2025 @Scale conference series will combine two of the most foundational topics across the stack, Systems & Reliability. This two-track program will feature technical talks about the demands of AI and the conference theme of "rising to the challenge."

This year will bring together technology experts from AMD, Clockwork Systems, Google, Meta, Microsoft, Netflix, NVIDIA, Pinterest, Pure Storage, and Union.ai to discuss compelling stories about solving the hardest hyper-scale problems with distributed systems, infra resilience and many more complex challenges by speakers from around the industry.

AGENDA SPEAKERS

EVENT AGENDA

Event times below are displayed in PT.

May 6

08:30 AM - 09:45 AM
Registration
08:30 AM - 09:45 AM
Breakfast, Raffle Submissions, and Networking
09:45 AM - 10:00 AM
Event Welcome
10:00 AM - 10:05 AM
Opening Remarks
10:05 AM - 10:30 AM
Keynote
Speaker Peter Hoose,Meta
Speaker Surupa Biswas,Meta
10:30 AM - 11:05 AM
Fireside Chat
Speaker Jay Parikh,Microsoft

Track 1

Track 2

11:05 AM - 11:25 AM
The Need for Speed: The Story of How We Achieved Industry Leading TTP

This presentation shares the high-stakes story of Meta's transformation from lagging behind the industry in GPU platform time-to-production (TTP) to becoming an industry leader. We'll begin by explaining the significance of TTP, providing a brief history of our New Product Introductions (NPIs), including AI platforms. A crash course on NPIs will follow, setting the stage for a deeper dive into the key lessons learned from these experiences. Next, we'll walk you through the major changes that drove our dramatic improvement in TTP and conclude with an overview of our current challenges and future work.

Speaker Richard Wareing,Meta
Speaker Tyler Graf,Meta
11:25 AM - 11:45 AM
From Silicon to Scale: Training at Scale with AMD's Software Stack

As large-scale AI models continue to redefine the frontier of computing, the infrastructure and software required to train them must evolve in lockstep. At AMD, we are building high-performance, scalable training platforms that power the next generation of machine learning applications.

Speaker Zhenyu Gu,AMD
11:45 AM - 12:45 PM
Lunch
12:45 PM - 01:10 PM
Gang Scheduling for Llama

The rapid advancement of AI has necessitated a fundamental shift in infrastructure, moving from homogenous workloads that fit within a single server to multi-host workloads requiring tight container coordination across multiple servers. This talk explores the motivations and design principles behind this shift, focusing on the implementation of first-class support for gang scheduling at all layers of the system. We delve into the key components of this design, including Twine and the Resource Allowance System (RAS), and examine how they enable AI serving schemes that employ various forms of parallelism—such as pipeline, context, tensor, and expert parallelism—requiring container shared fate properties and network topology-aware allocation. By addressing these challenges, we aim to provide insights into building scalable and reliable systems that meet the demands of modern AI workloads.

Speaker Anca Agape,Meta
Speaker Andrei Darabanov,Meta
01:10 PM - 01:40 PM
Live Q&A Session #1
01:40 PM - 01:50 PM
Challenges with Ultra-low Latency LLM Inference at Scale

In this talk, we will discuss the challenges of running ultra-low latency Large Language Model (LLM) inference at scale. We will cover the unique challenges of LLM inference, such as large model sizes, KV Caching. We will also discuss the challenges of scaling LLM inference to handle large volumes of requests, including the need for hardware, efficient scale up, and new routing architectures. Finally, we will present some of our recent work on addressing these challenges, including our development of inference infrastructure at Union.

Speaker Haytham Abuelfutuh,Union.ai
01:50 PM - 02:05 PM
Advancing Flash Storage @ Meta

The growth of data and need for increased power efficiency are leading to innovative storage solutions. HDDs have been growing in density, but not performance, and TLC flash remains at a price point that is restrictive for scaling. QLC technology addresses these challenges by forming a middle tier between HDDs and TLC SSDs. At Meta we are deploying high density QLC racks at scale, driving innovation to overcome interesting software and hardware challenges, and promoting ecosystem alignment in this evolving storage space.

Speaker Sumit Gupta,Meta
Speaker Riley Thomasson,Pure Storage
02:05 PM - 02:35 PM
Break
02:35 PM - 02:50 PM
A Planet-Scale Computer – Abstract Away Regions via Global Service Placer (GSP)

Both public clouds and our hyperscale private cloud have evolved into complex infrastructures with millions of servers spanning numerous global data center regions. Leaving users to manage the complexity of deploying global services across regions incurs significant operational toil and often leads to suboptimal outcomes. Users must select regions, align global traffic routing with service deployments and ensure disaster preparedness, while optimizing cost. They must continually repeat these processes to adapt to workload changes. To eliminate these manual burdens, we introduce the Global Service Placer (GSP), which autonomously places services across regions based on user-defined latency SLOs, abstracting away details of regions from users, such as their geo-distribution and resource availability, while optimizing efficiency. The generality and efficacy of GSP has been demonstrated via onboarding a diverse set of big and complex services. We provide a case study for highly complex AI inference workload and show significant GPU savings.

Speaker Gerald Guo,Meta
Speaker Yatpang Cheung,Meta
Speaker Yunqi Zhang,Meta
02:50 PM - 03:05 PM
Model Freshness and Its Infra Implications

Meta's recommendation systems rely on "freshness" – the speed at which user interaction signals are ingested, trained, and utilized. To improve model freshness, Meta developed solutions addressing scaling, serving footprints, and diverse architectures.

Speaker Vivek Khurana,Meta
Speaker Xianzheng Dou,Meta
Speaker Lujia Zhang,Meta
03:05 PM - 03:30 PM
Experience Operating Large GPU Clusters at Organizational Scale

We outline Nvidia's experience managing a large-scale internal GPU compute platform spanning multiple heterogeneous clusters. The platform supports thousands of users and hundreds of project accounts, handling a diverse mix of training and batch inference workloads across various research fields. We focus on three key challenges: researcher productivity, resource utilization, and operational efficiency. To improve researcher productivity, we emphasize fair scheduling and workload resilience. To keep resource utilization high, we discuss strategies to maintain high occupancy. On the operational efficiency front, we highlight our scheduler simulation capabilities, which enable safe testing of changes without affecting production workloads. The presentation concludes with key lessons learned and our vision for future improvements.

Speaker Vikas Mehta,NVIDIA
Speaker Bugra Gedik,NVIDIA
Speaker Mohamed Fawzy,NVIDIA
Speaker Vipin Sirohi,NVIDIA
11:05 AM - 11:25 AM
AI Hardware Reliability at Scale

This talk will describe our journey with AI hardware reliability (GPU/Silicon) running large scale training and inference in Meta. It will highlight our efforts across the ecosystem, covering vendor systems and our own custom silicon efforts to run AI hardware reliably at scale. For SW/Services audience, this will provide a under-the-hood look into how AI hardware reliability impacts AI applications and how Meta is driving the industry.

Speaker Sriram Sankar,Meta
Speaker Harish Dixit,Meta
11:25 AM - 11:45 AM
How We’re Scaling Discovery at Netflix Reliably

In the dynamic realm of streaming services, reliability and scalability are imperatives. This talk unveils the sophisticated architecture of Netflix's Member Discovery System, known as Mosaic, which powers key member-facing pages. Discover the innovative strategies that ensure Mosaic's robustness and reliability, and learn how Netflix sets the standard in delivering a seamless user experience to millions worldwide.

We will explore unconventional testing and deployment strategies that address the vast scale of our user base and diverse client devices, alongside sophisticated failover strategies that safeguard service continuity. Join us to uncover how Netflix maintains its competitive edge in delivering a reliable and scalable discovery system, and be inspired by the cutting-edge techniques that power Mosaic.

Speaker Karthik Puthraya,Netflix
Speaker Saurabh Jaluka,Netflix
11:45 AM - 12:45 PM
Lunch
12:45 PM - 01:10 PM
Temporary Solutions, Lasting Impact

The relentless pursuit of more AI is driving us to embrace calculated risks through innovative approaches and to boldly place hardware in “tent”-like structures. How does infrastructure tradeoff reliability for speed?

Speaker Michael Bejda,Meta
Speaker Prathyusha Peddi,Meta
01:10 PM - 01:40 PM
Live Q&A Session #2
01:40 PM - 01:50 PM
Turbocharging AI/ML workloads: Revving Up Speed and Resilience

Race cars are built for speed and resilience, equipped with cutting-edge features to reach high velocities while maintaining a firm grip on the perilous track. What if we could apply similar features to boost the speed and resilience of AI/ML jobs running over complex networking fabrics?
In this session, we’ll dive into the key networking challenges impacting AI/ML workloads, such as NIC and link flapping, network contention, and congestion. These issues not only slow down job completion times but also eat into ROI by increasing the likelihood of interruptions and costly rollbacks. I’ll demonstrate these challenges in action and introduce a solution that enhances network visibility for AI/ML jobs while ensuring smooth, uninterrupted performance even in the face of link instability or congested paths. By addressing these issues, we can optimize the efficiency of AI/ML jobs, reduce time lost to disruptions, and improve ROI by avoiding the need to revert to previous checkpoints.

Join me to learn how to supercharge your AI/ML workloads for speed and resilience — just like a race car engineered for peak performance!

Speaker Lerna Ekmekcioglu,Clockwork Systems
01:50 PM - 02:05 PM
Splitting the Monolith

For over a decade, Facebook and most other Meta products have been powered by a single monolithic PHP application. Since 2022, we have been investing in a multitenancy framework to allow product specialization with less operational overhead. This has allowed Meta to move fast in generative AI, by providing a familiar development environment for our engineers with optimizations for inference workloads.

Speaker Phil Lopreiato,Meta
Speaker Zach Zundel,Meta
02:05 PM - 02:35 PM
Break
02:35 PM - 02:50 PM
Building KVStore for ML Workloads at Pinterest

This presentation introduces Pinterest's KVStore, a distributed key-value store designed to support machine learning workloads that are central to Pinterest's functions. KVStore enables efficient low-latency ML feature serving with various data update methods. KVStore is crucial for Pinterest's AI/ML-driven platform, evolving to meet business needs with a focus on reliability and efficiency at scale.

Speaker Jia Zhan,Pinterest
02:50 PM - 03:05 PM
Journey to 1000 Models: Scaling Instagram's algorithm without the Reliability Nightmare

At the beginning of 2023, Instagram had O(10) gpu models, a manual release process, and a manual monitoring setup. This talk will be centered around our journey to 1000 models: the bumps along the road and the foundational work built to make monitoring model health faster and more accurate. We’ll be going over model registry, the model launch process, and model stability.

Speaker Sing Sing Ma,Meta
Speaker Luke Levis,Meta
03:05 PM - 03:30 PM
Product Reliability in Google Maps

While our organization excelled at maintaining server SLOs for Google Maps, we discovered that many user-impacting incidents, particularly those stemming from client-side issues like mobile app rollouts, remained undetected by server-centric monitoring. This realization prompted a strategic shift towards product reliability, prioritizing the end-user experience. This talk will discuss how we navigated this transition, sharing our progress in addressing challenges, the valuable lessons learned, and our evolving vision for a holistic, user-focused reliability strategy.

Speaker Micah Lerner,Google
03:30 PM - 04:00 PM
Table Talks
04:00 PM - 05:00 PM
Happy Hour & Networking

SPEAKERS AND MODERATORS

Peter Hoose is the head of Production Engineering at Meta. PE is a unique... read more

Peter Hoose

Meta

Surupa Biswas is the Vice President of Engineering responsible for Core Infrastructure at Meta,... read more

Surupa Biswas

Meta

Executive Vice President at Microsoft. read more

Jay Parikh

Microsoft

Richard has been a Production Engineer at Meta for 13 years, initially working in... read more

Richard Wareing

Meta

Tyler Graf has been deploying Infrastructure for Meta for 8 years and has played... read more

Tyler Graf

Meta

Zhenyu has strong experience in building high performance AI/ML infrastructure at scale that cover... read more

Zhenyu Gu

AMD

Anca is a seasoned software engineer with over 11 years of experience at Meta,... read more

Anca Agape

Meta

Andrei has two decades of experience designing and building distributed systems across different scales.... read more

Andrei Darabanov

Meta

Haytham Abuelfutuh is co-founder and CTO at Union.ai where he works on pushing the... read more

Haytham Abuelfutuh

Union.ai

Sumit has been in the storage industry for 30 years and has been in... read more

Sumit Gupta

Meta

Riley has been building high-performance data paths at Pure Storage for over a decade.... read more

Riley Thomasson

Pure Storage

Gerald is a Research Scientist at Meta, and working on building a global service... read more

Gerald Guo

Meta

Yatpang is an Engineering Manager at Meta where he supports the team responsible for... read more

Yatpang Cheung

Meta

Yunqi is a Software Engineer at Meta, working on service management challenges for Meta's... read more

Yunqi Zhang

Meta

Vivek is an Engineering Manager in Core Infra at Meta and has years of... read more

Vivek Khurana

Meta

Xianzheng Dou is a Research Scientist at Meta, based in Bellevue, USA. He has... read more

Xianzheng Dou

Meta

Lujia works in AI infrastructure, focusing on inference runtime and model freshness. With experience... read more

Lujia Zhang

Meta

Vikas Mehta is a Principal Software Engineer at Nvidia, where he focuses on developing... read more

Vikas Mehta

NVIDIA

Bugra Gedik is a seasoned software engineer with extensive experience in AI infrastructure, specializing... read more

Bugra Gedik

NVIDIA

Mohamed Fawzy works on building large-scale AI infrastructure to enhance researcher productivity and optimize... read more

Mohamed Fawzy

NVIDIA

Vipin is Principal HPC engineer at Nvidia with over a decade of experience in... read more

Vipin Sirohi

NVIDIA

Sriram Sankar is a Director of Engineering at Meta, leading teams responsible for the... read more

Sriram Sankar

Meta

Harish leads the efforts on infrastructure silicon reliability strategy, hardware lifecycle efficiency and silent... read more

Harish Dixit

Meta

Karthik Puthraya is a senior software engineer with over a decade of experience building... read more

Karthik Puthraya

Netflix

Saurabh is a seasoned Senior Software Engineer with over a decade of experience in... read more

Saurabh Jaluka

Netflix

Michael has been engineering software on Meta's Core Systems and Infrastructure teams since 2014.... read more

Michael Bejda

Meta

Prathyusha is a passionate software engineer with a remarkable ability to solve problems and... read more

Prathyusha Peddi

Meta

Lerna is a Senior Solutions Engineer at Clockwork Systems where she helps customers meet... read more

Lerna Ekmekcioglu

Clockwork Systems

Phil is a Software Engineer at Meta who works on sitewide reliability, scalability, and... read more

Phil Lopreiato

Meta

Zach is a Production Engineer at Meta who works on platform reliability and incident... read more

Zach Zundel

Meta

Jia is senior staff software engineer at Pinterest. He currently leads the NoSQL development... read more

Jia Zhan

Pinterest

Sing Sing is a Production Engineer on the Instagram Relevance Infra team specializing in... read more

Sing Sing Ma

Meta

Luke Levis is a Production Engineer with over 8 years of experience at Meta.... read more

Luke Levis

Meta

Micah Lerner is a tech lead at Google focused on user-focused reliability for Google... read more

Micah Lerner

Google

2025 Events

@Scale is a technical conference series for engineers who build or maintain systems designed for scale. New this year, in person and virtual attendance options will be available at all four of our programs, which will bring together complementary themes to create event communities to spark cross-discipline collaboration.

SYSTEMS & RELIABILITY - MAY 6, 2025

Hosted In Person & Virtually
Meta Campus, Menlo Park

The first installment of the 2025 @Scale conference series will combine two of the most foundational topics across the stack, Systems & Reliability. This two-track program will feature technical talks about the demands of AI and the conference theme of "rising to the challenge." The themed talks will include compelling stories about solving the hardest hyper-scale problems with distributed systems, infra resilience and many more complex challenges by speakers from around the industry.

Register today and stay tuned for agenda announcements in the coming weeks.

AI & DATA - JUNE 25, 2025

Hosted In Person & Virtually

Register today to be notified when the @Scale conference on AI & Data has been scheduled.

NETWORKING - AUGUST 13, 2025

Hosted In Person & Virtually
Santa Clara Convention Center

In 2025, @Scale: Networking will continue to focus on the evolution of AI Networking. To address the growing complexity of network operations, we will examine a full-stack perspective towards debugging, encompassing the communications layer through to the hardware. By adopting a holistic approach across both our font-end and back-end networks we can identify and mitigate potential bottlenecks, ensuring optimal network performance. You will also hear from industry experts and leading researchers who are at the forefront of building large scale networks. Attendees will benefit from the opportunity to learn about diverse approaches to solve common challenges and explore potential collaborations.

Register today and stay tuned for upcoming agenda announcements.

PRODUCT - OCTOBER 22, 2025

Hosted In Person & Virtually
Meta Campus, Menlo Park

@Scale: Product is an exciting evolution of the conference series, bringing together the best of Product @Scale, RTC @Scale, Mobile @Scale, and Video @Scale. This comprehensive program is designed for engineers who are passionate about building and optimizing large-scale products. Attendees will gain insights into the latest innovations, best practices, and tools that drive efficiency and performance across product development, real-time communication, mobile platforms, and video technologies.

Register today and stay tuned for upcoming agenda announcements.

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy