Systems & Reliability 2026

June 25, 2026
Bellevue, Washington

Building the advanced infrastructure necessary to power today's sophisticated AI models represents a monumental engineering challenge. This endeavor demands the creation of highly scalable, high-performance, and supremely reliable computing systems that are not just general-purpose but are meticulously tailored to the unique, often unpredictable, demands of machine learning workloads. This specialized infrastructure includes custom silicon accelerators (like GPUs and TPUs), ultra-high-speed networking fabrics, and novel storage architectures designed for massive data throughput—all optimized for parallel processing at an unprecedented scale.

Simultaneously, the very AI that this infrastructure supports is fundamentally revolutionizing the systems themselves. AI is now being deployed to design, operate, and optimize these large-scale systems, leading to a new generation of intelligent, efficient, and resilient system management. Machine learning algorithms are used for dynamic resource allocation, predictive maintenance to prevent outages, and sophisticated anomaly detection to maintain system health. This AI-driven optimization moves system management from reactive problem-solving to proactive, self-optimizing operation.

AGENDA SPEAKERS

EVENT AGENDA

Event times below are displayed in PT.

June 25

08:30 AM - 09:45 AM
Registration
08:30 AM - 09:45 AM
Breakfast, Raffle Submissions, and Networking
09:45 AM - 09:50 AM
Event Welcome
09:50 AM - 10:10 AM
Keynote from Meta

Surupa and Peter will share the latest progress from Meta's mission to build the future of human connection and the technology that makes it possible, including infrastructure for AI by leveraging agentic innovation across the stack.

Speaker Surupa Biswas,Meta
Speaker Peter Hoose,Meta
10:10 AM - 10:35 AM
To Be Announced
10:35 AM - 10:55 AM
Our Journey to Safely Unleash Agents at Meta Scale

Engineering at Meta is in the midst of a productivity revolution, driven by AI. Our goal is to dramatically increase engineering velocity by using agents to build new systems and operate existing ones. Teams are not just using AI to write code faster; they’re also leveraging AI to debug GPU failures, remediate system outages and more.

To unleash agentic productivity, we have embarked on a journey to safely enable agentic access to Meta’s systems. This talk will dive into what we’ve learned so far - building internal agents, deploying 3rd party agents, observing agents' interactions with our Infra, and deploying guardrails to prevent destructive actions.

Speaker David Pariag,Meta
10:55 AM - 11:15 AM
The Agentic Infrastructure Gap: In-Distribution Languages Make It a Coding Problem

To handle the scale and velocity of AI-written code, we will have no choice but to let AI manage our infrastructure too. Yet Andrej Karpathy recently described getting an app running in production as “assembling IKEA furniture”: cloud consoles, API keys, copy-pasted config, glue, ... all things that sit outside the code an LLM can reason about. Frontier models are trained on billions of lines of real languages like Python, TypeScript, and Go, and vanishingly little bespoke DSLs and manual procedures. By modeling infrastructure in code space so the LLM can do what it does best — code — we just need an oracle that can map code changes back to infrastructure outcomes. In this talk, I’ll share what we’ve learned at Pulumi working alongside leading AI companies and frontier labs to build for a world where agents manage infrastructure. The platforms that win in this new era will look different, but many of the human-ergonomic benefits of programming languages are what will get us there.

Speaker Joe Duffy,Pulumi
11:15 AM - 11:35 AM
Teaching AI to Fight Fires: Autonomous Reliability Agents at Meta Scale

What if your best oncall engineer never slept, investigated every alert in parallel, and got smarter after every SEV? At Meta, we're building that reality. This talk covers how Production Engineering is deploying AI agents that autonomously investigate, diagnose, and increasingly mitigate production incidents across systems serving billions of users.
We'll deep-dive into RecInvestigator — an AI agent that has investigated 300+ SEVs across Meta's Recommendation Systems, cutting p50 MTDM by 51%. We'll explain how it encodes expert investigation workflows, correlates signals across Service Router, Tupperware, ODS, Scuba, and Presto, and maintains an ~80% match rate with human investigators. We'll then show how we're using AI-driven SEV pattern analysis to move from reactive firefighting to proactive prevention — identifying systemic failure modes, proposing architectural fixes, and prototyping agents that auto-mitigate service regressions during business hours.
The talk will cover the full journey we're on: (1) AI-powered pattern identification across hundreds of SEVs, (2) autonomous investigation agents that match human accuracy, (3) fixer agents that propose and implement reliability improvements in code, and (4) the moonshot — AI-native reliability where systems self-heal and development is reliability-aware by default. We'll share what worked, what didn't, and what we've learned about the trust boundary between autonomous agents and human engineers in production.

Speaker Gaurav Mitra,Meta
11:35 AM - 11:55 AM
Live Q&A Session
Speaker Brendan Burns,Microsoft
Speaker David Pariag,Meta
Speaker Gaurav Mitra,Meta
11:55 AM - 01:15 PM
Lunch & Networking

Track 1 (Center Hall B)

Track 2 (Level 4 Stage)

01:20 PM - 01:40 PM
AI Storage Blueprint

The rapid, exponential growth in model capabilities and training dataset sizes over the last few years has accelerated AI innovation. New frontier models are now being released in a matter of weeks, down from months just a year ago. This pace makes reliable, consistent fast access to storage essential for managing both the speed and cost of development. This presentation will detail how we evolved Meta's Storage Architecture to overcome two key hurdles: optimizing GPU Utilization and maximizing Research Velocity.

Speaker Sidharth Bajaj,Meta
Speaker Venkatraghavan Srinivasa,Meta
01:50 PM - 02:10 PM
Experience Productionizing and Operating GB200 clusters

The NVIDIA GB200 NVL72 functions as an exascale computer within a single rack. Maximizing the utilization of its NVLink fabrics through proper scheduling allows AI training jobs to significantly leverage the substantial networking bandwidth this system offers. Notably, recent MLPerf training results (>2.6x improvement with recent MLPerf training) confirm that GB200 NVL72 provides substantial performance gains across all AI workloads. This presentation will cover key topics related to operating and optimizing large GB200 clusters, including:
Topology-aware scheduling within Slurm.
Recommendations for optimal GPU occupancy and scheduling (slurm simulator).
Our experience bringing up and running a large GB200 cluster. (Healtch checks, Supporting large scale training experience, scheduling, rack testing, slow nodes etc)

Speaker Ankur Srivastava,NVIDIA
Speaker Sachin Kumar Lakharia,NVIDIA
02:20 PM - 02:40 PM
Building Privacy Aware Infrastructure in the AI-Native Era

Building privacy-aware infrastructure in the AI-native era requires systems that can learn, evaluate, and operationalize improvements continuously—at the scale of millions of data assets. This talk presents an AI-native approach through a case study of AI-native asset classification: a hybrid deterministic rules engine with LLM fallback that transforms messy context (metadata, lineage, code references, scan signals) into enforceable privacy controls. We’ll highlight ground truth (GT) + evaluation (eval) self-improvement loops, where active learning focuses human review on high-value examples, an LLM-as-Judge provides scalable review signal (including kappa reliability), and evaluation gates prevent regressions and circular reinforcement. Finally, we’ll show how high-performing LLM behavior is distilled into auditable rules.yaml, driving LLM usage toward near-zero while improving determinism, latency, and cost. Attendees will leave with practical patterns for building privacy-by-design infrastructure that converges, exports human-readable logic, and monitors drift in production.

Speaker Rituraj Kirti,Meta
02:50 PM - 03:10 PM
Securing Production Debugging at Hyperscale

This presentation will go over how Microsoft uses SSH to debug across our global fleet of K8s clusters at hyperscale. We'll cover how we integrate that with JIT elevations and Task-Based Execution in a secure and auditable way. We will alsogo over how we use these building blocks to engage AI and the Semantic Kernel to empower our on-call engineers to resolve issues quickly.

Speaker Shridivya Sharma,Microsoft
Speaker Luke Kelly,Microsoft
03:10 PM - 03:35 PM
Transition Back to Track 1
01:20 PM - 01:40 PM
[Almost] Fail & Tell: Stop The World

In March 2025, Meta avoided an major outage that would have resulted in prolonged sitewide unavailability. Instead, we were able to quickly contain the issue in a matter of minutes and avoid the worst case scenario. This is a story of how we re-bootstrapped the infrastructure control plane in an entire region, on the fly, using emergency tooling. It's a showcase of how to put DR-preparedness and calm incident response into practice. And it's a pretty cool story.

Speaker Phil Lopreiato,Meta
Speaker Rahul Iyengar,Meta
01:50 PM - 02:10 PM
Talk from Google

TBD

Speaker Ioannis Papapanagiotou,Google
02:20 PM - 02:40 PM
Taming AI Infrastructure Failures with Agentic Debugging

NCCL watchdog timeouts are a common failure mode in distributed AI model training. They impact not only Meta, but broadly affect anyone running PyTorch distributed training—and they’re notoriously hard to debug: even experts can spend hours triaging a single incident, and non-experts may be unable to root-cause them at all. Over the past year, we investigated NCCL watchdog timeouts internally at Meta and partnered with the PyTorch community to categorize the major root-cause buckets. We then distilled these learnings into a practical decision tree and runbook to speed up triage and make debugging more accessible. We also explored using agent-based approaches to assist root-cause analysis and saw strong early results. (More details TBD.)

Speaker Phillip Liu,Meta
02:50 PM - 03:10 PM
Talk information to be announced
03:10 PM - 03:35 PM
Transition Back to Track 1
03:35 PM - 04:05 PM
Live Panel

Panel information to be announced!

04:05 PM - 04:25 PM
Keynote from Meta

Surupa and Peter will share the latest progress from Meta's mission to build the future of human connection and the technology that makes it possible, including infrastructure for AI by leveraging agentic innovation across the stack.

Speaker Peter Hoose,Meta
04:25 PM - 04:30 PM
Closing Remarks
04:30 PM - 06:00 PM
Networking Happy Hour

SPEAKERS AND MODERATORS

Surupa Biswas is the Vice President of Engineering responsible for Core Infrastructure at Meta,... read more

Surupa Biswas

Meta

Peter Hoose is the head of Production Engineering at Meta. PE is a unique... read more

Peter Hoose

Meta

David Pariag is a software engineer focusing on Meta’s Monitoring Products. He’s spent the... read more

David Pariag

Meta

Joe Duffy is Founder and CEO of Pulumi, a venture-backed Seattle company bringing programming... read more

Joe Duffy

Pulumi

I am a production engineer at Meta Recommendation Systems (MRS), working on reliability backend... read more

Gaurav Mitra

Meta

As a Director of Engineering at Meta, Sidharth focuses on evolving the company's core... read more

Sidharth Bajaj

Meta

Venkat is a Software Engineer at Meta currently focused on evolving the Storage stack... read more

Venkatraghavan Srinivasa

Meta

Ankur Srivastava is an ML Engineer at NVIDIA specializing in AI performance and efficiency.... read more

Ankur Srivastava

NVIDIA

Sachin Lakharia is a Principal Engineer at NVIDIA, where he leads multiple projects focused... read more

Sachin Kumar Lakharia

NVIDIA

Rituraj Kirti is a Software Engineer at Meta who builds reusable patterns a.k.a ‘recipes’... read more

Rituraj Kirti

Meta

Divya Sharma is a Senior Software Engineer at Microsoft with over a decade of... read more

Shridivya Sharma

Microsoft

I am a father of 3 sons; tech enthusiast whose favorite past time includes... read more

Luke Kelly

Microsoft

Phil is a Software Engineer at Meta who works on sitewide reliability, scalability, and... read more

Phil Lopreiato

Meta

Rahul is a production engineer at Meta. read more

Rahul Iyengar

Meta

Ioannis Papapanagiotou is a Principal Engineer at Google. Ioannis is also a research assistant... read more

Ioannis Papapanagiotou

Google

I’ve focused on building and scaling ML-driven product experiences over the past year, with... read more

Phillip Liu

Meta
UPCOMING EVENT   | Data, Machine Learning and AI

AI & Data 2026

June 17, 2026 Menlo Park, CA Meta’s Engineering and Infrastructure teams are excited to bring together a global contingent of engineers who are interested in building, operating, and using AI and data systems at scale....
UPCOMING EVENT   | Systems and Networking

Systems & Reliability 2026

June 25, 2026 Bellevue, Washington Building the advanced infrastructure necessary to power today's sophisticated AI models represents a monumental engineering challenge. This endeavor demands the creation of highly scalable, high-performance, and supremely reliable computing systems...
UPCOMING EVENT   | Systems and Networking

Networking 2026

August 25, 2026 Santa Clara, CA In 2026, @Scale: Networking will continue to focus on the evolution of AI Networking. To address the growing complexity of network operations, we will examine a full-stack perspective towards...
UPCOMING EVENT   | Mobile, Video and Web

Product 2026

October 28, 2026 Menlo Park, CA @Scale: Product is an exciting evolution of the @Scale conference series, uniting the best of Product, RTC, Mobile, and Video under a single AI-native theme. We are continuing to...

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy