Systems & Reliability 2026

June 25, 2026
Meydenbauer Center, Bellevue, Washington

Building the advanced infrastructure necessary to power today's sophisticated AI models represents a monumental engineering challenge. This endeavor demands the creation of highly scalable, high-performance, and supremely reliable computing systems that are not just general-purpose but are meticulously tailored to the unique, often unpredictable, demands of machine learning workloads. This specialized infrastructure includes custom silicon accelerators (like GPUs and TPUs), ultra-high-speed networking fabrics, and novel storage architectures designed for massive data throughput—all optimized for parallel processing at an unprecedented scale.

Simultaneously, the very AI that this infrastructure supports is fundamentally revolutionizing the systems themselves. AI is now being deployed to design, operate, and optimize these large-scale systems, leading to a new generation of intelligent, efficient, and resilient system management. Machine learning algorithms are used for dynamic resource allocation, predictive maintenance to prevent outages, and sophisticated anomaly detection to maintain system health. This AI-driven optimization moves system management from reactive problem-solving to proactive, self-optimizing operation.

AGENDA SPEAKERS

EVENT AGENDA

Event times below are displayed in PT.

June 25

08:30 AM - 09:45 AM

Registration

08:30 AM - 09:45 AM

Breakfast, Raffle Submissions, and Networking

09:45 AM - 09:50 AM

Event Welcome

09:50 AM - 10:10 AM

Keynote from Meta

Surupa and Peter will share the latest progress from Meta's mission to build the future of human connection and the technology that makes it possible, including infrastructure for AI by leveraging agentic innovation across the stack.

Speaker Surupa Biswas,Meta

Speaker Peter Hoose,Meta

10:10 AM - 10:35 AM

To Be Announced

10:35 AM - 10:55 AM

Our Journey to Safely Unleash Agents at Meta Scale

Engineering at Meta is in the midst of a productivity revolution, driven by AI. Our goal is to dramatically increase engineering velocity by using agents to build new systems and operate existing ones. Teams are not just using AI to write code faster; they’re also leveraging AI to debug GPU failures, remediate system outages and more.

To unleash agentic productivity, we have embarked on a journey to safely enable agentic access to Meta’s systems. This talk will dive into what we’ve learned so far - building internal agents, deploying 3rd party agents, observing agents' interactions with our Infra, and deploying guardrails to prevent destructive actions.

Speaker David Pariag,Meta

10:55 AM - 11:15 AM

The Agentic Infrastructure Gap: In-Distribution Languages Make It a Coding Problem

To handle the scale and velocity of AI-written code, we will have no choice but to let AI manage our infrastructure too. Yet Andrej Karpathy recently described getting an app running in production as “assembling IKEA furniture”: cloud consoles, API keys, copy-pasted config, glue, ... all things that sit outside the code an LLM can reason about. Frontier models are trained on billions of lines of real languages like Python, TypeScript, and Go, and vanishingly little bespoke DSLs and manual procedures. By modeling infrastructure in code space so the LLM can do what it does best — code — we just need an oracle that can map code changes back to infrastructure outcomes. In this talk, I’ll share what we’ve learned at Pulumi working alongside leading AI companies and frontier labs to build for a world where agents manage infrastructure. The platforms that win in this new era will look different, but many of the human-ergonomic benefits of programming languages are what will get us there.

Speaker Joe Duffy,Pulumi

11:15 AM - 11:35 AM

Extending Meta to the Public Cloud

Meta has a massive amount of internal infrastructure, but the demand for GPU compute led us to look to the cloud to grow our footprint. This led to our initial forays into cloud, and what it took to enable and integrate into Meta’s overall infrastructure footprint. In order to create capacity hyperliquidity, we’ve leveraged multi-cloud as a form of optionality. Come learn how we’ve fully integrated cloud into our platform, OS, and fleet management stack to make the experience ubiquitous for our internal customers.

Speaker Dinesh Govindasamy,Meta

Speaker Sargun Dhillon,Meta

11:35 AM - 11:55 AM

Live Q&A Session

Speaker David Pariag,Meta

Speaker Sargun Dhillon,Meta

Speaker Dinesh Govindasamy,Meta

Speaker Joe Duffy,Pulumi

11:55 AM - 01:15 PM

Lunch & Networking

Track 1 (Center Hall B)

Track 2 (Level 4 Stage)

01:20 PM - 01:40 PM

AI Storage Blueprint

The rapid, exponential growth in model capabilities and training dataset sizes over the last few years has accelerated AI innovation. New frontier models are now being released in a matter of weeks, down from months just a year ago. This pace makes reliable, consistent fast access to storage essential for managing both the speed and cost of development. This presentation will detail how we evolved Meta's Storage Architecture to overcome two key hurdles: optimizing GPU Utilization and maximizing Research Velocity.

Speaker Sidharth Bajaj,Meta

Speaker Venkatraghavan Srinivasa,Meta

01:50 PM - 02:10 PM

Experience Productionizing and Operating GB200 clusters

The NVIDIA GB200 NVL72 functions as an exascale computer within a single rack. Maximizing the utilization of its NVLink fabrics through proper scheduling allows AI training jobs to significantly leverage the substantial networking bandwidth this system offers. Notably, recent MLPerf training results (>2.6x improvement with recent MLPerf training) confirm that GB200 NVL72 provides substantial performance gains across all AI workloads. This presentation will cover key topics related to operating and optimizing large GB200 clusters, including:
Topology-aware scheduling within Slurm.
Recommendations for optimal GPU occupancy and scheduling (slurm simulator).
Our experience bringing up and running a large GB200 cluster. (Healtch checks, Supporting large scale training experience, scheduling, rack testing, slow nodes etc)

Speaker Ankur Srivastava,NVIDIA

Speaker Sachin Kumar Lakharia,NVIDIA

Speaker Douglas Wightman,NVIDIA

02:20 PM - 02:40 PM

Building Privacy Aware Infrastructure in the AI-Native Era

Building privacy-aware infrastructure in the AI-native era requires systems that can learn, evaluate, and operationalize improvements continuously—at the scale of millions of data assets. This talk presents an AI-native approach through a case study of AI-native asset classification: a hybrid deterministic rules engine with LLM fallback that transforms messy context (metadata, lineage, code references, scan signals) into enforceable privacy controls. We’ll highlight ground truth (GT) + evaluation (eval) self-improvement loops, where active learning focuses human review on high-value examples, an LLM-as-Judge provides scalable review signal (including kappa reliability), and evaluation gates prevent regressions and circular reinforcement. Finally, we’ll show how high-performing LLM behavior is distilled into auditable rules.yaml, driving LLM usage toward near-zero while improving determinism, latency, and cost. Attendees will leave with practical patterns for building privacy-by-design infrastructure that converges, exports human-readable logic, and monitors drift in production.

Speaker Rituraj Kirti,Meta

02:50 PM - 03:10 PM

Securing Production Debugging at Hyperscale

This presentation will go over how Microsoft uses SSH to debug across our global fleet of K8s clusters at hyperscale. We'll cover how we integrate that with JIT elevations and Task-Based Execution in a secure and auditable way. We will alsogo over how we use these building blocks to engage AI and the Semantic Kernel to empower our on-call engineers to resolve issues quickly.

Speaker Shridivya Sharma,Microsoft

Speaker Luke Kelly,Microsoft

03:10 PM - 03:35 PM

Transition Back to Track 1

01:20 PM - 01:40 PM

[Almost] Fail & Tell: Stop The World

In March 2025, Meta avoided an major outage that would have resulted in prolonged sitewide unavailability. Instead, we were able to quickly contain the issue in a matter of minutes and avoid the worst case scenario. This is a story of how we re-bootstrapped the infrastructure control plane in an entire region, on the fly, using emergency tooling. It's a showcase of how to put DR-preparedness and calm incident response into practice. And it's a pretty cool story.

Speaker Phil Lopreiato,Meta

Speaker Rahul Iyengar,Meta

01:50 PM - 02:10 PM

Talk from Google

TBD

Speaker Micah Lerner,Google

02:20 PM - 02:40 PM

Taming AI Infrastructure Failures with Agentic Debugging

NCCL watchdog timeouts are a common failure mode in distributed AI model training. They impact not only Meta, but broadly affect anyone running PyTorch distributed training—and they’re notoriously hard to debug: even experts can spend hours triaging a single incident, and non-experts may be unable to root-cause them at all. Over the past year, we investigated NCCL watchdog timeouts internally at Meta and partnered with the PyTorch community to categorize the major root-cause buckets. We then distilled these learnings into a practical decision tree and runbook to speed up triage and make debugging more accessible. We also explored using agent-based approaches to assist root-cause analysis and saw strong early results. (More details TBD.)

Speaker Phillip Liu,Meta

02:50 PM - 03:10 PM

Teaching AI to Fight Fires: Autonomous Reliability Agents at Meta Scale

In December 2024, a single config change at Meta took 50+ engineers and 28 hours to recover from. What if an AI agent had detected the cascade in seconds and proposed the rollback before it propagated?

Over the past year, we've been building exactly that. This talk introduces the reliability flywheel: a system that encodes the best on-call engineer's investigation methods into an always-on copilot that gets sharper with every incident. Our investigation agent has handled 1,000+ incidents across Meta's recommendation systems, cutting detection-to-mitigation time by 60% and matching senior on-call engineers ~80% of the time. It correlates signals across time-series metrics, deployment logs, configuration changes, and infrastructure health through a context-engineered architecture: structured tool servers and reusable workflows.

We'll trace the full progression: (1) pattern identification across hundreds of incidents, (2) autonomous investigation that matches human accuracy, (3) supervised mitigation for low-risk reversible actions, and (4) self-healing as the long-term outcome.

You'll leave with what worked, what didn't, and how we think about the trust boundary between autonomous agents and human engineers in production.

Speaker Gaurav Mitra,Meta

03:10 PM - 03:35 PM

Transition Back to Track 1

03:35 PM - 04:05 PM

Live Panel

Panel information to be announced!

04:05 PM - 04:25 PM

Keynote from Meta

Speaker Peter Hoose,Meta

04:25 PM - 04:30 PM

Closing Remarks

04:30 PM - 06:00 PM

Networking Happy Hour

SPEAKERS AND MODERATORS

Surupa Biswas is the Vice President of Engineering responsible for Core Infrastructure at Meta,... read more

Surupa Biswas

Meta

Peter Hoose is the head of Production Engineering at Meta. PE is a unique... read more

Peter Hoose

Meta

David Pariag is a software engineer focusing on Meta’s Monitoring Products. He’s spent the... read more

David Pariag

Meta

Joe Duffy is Founder and CEO of Pulumi, a venture-backed Seattle company bringing programming... read more

Joe Duffy

Pulumi

Dinesh Govindasamy is a Director of Engineering on the Core Infra team at Meta,... read more

Dinesh Govindasamy

Meta

Sargun Dhillon is a Software Engineer on the Core Infra team at Meta, where... read more

Sargun Dhillon

Meta

As a Director of Engineering at Meta, Sidharth focuses on evolving the company's core... read more

Sidharth Bajaj

Meta

Venkat is a Software Engineer at Meta currently focused on evolving the Storage stack... read more

Venkatraghavan Srinivasa

Meta

Phil is a Software Engineer at Meta who works on sitewide reliability, scalability, and... read more

Phil Lopreiato

Meta

Rahul is a production engineer at Meta. read more

Rahul Iyengar

Meta

Ankur Srivastava is an ML Engineer at NVIDIA specializing in AI performance and efficiency.... read more

Ankur Srivastava

NVIDIA

Sachin Lakharia is a Principal Engineer at NVIDIA, where he leads multiple projects focused... read more

Sachin Kumar Lakharia

NVIDIA

Douglas Wightman is a senior systems software engineer at Nvidia. He has spent his... read more

Douglas Wightman

NVIDIA

Micah Lerner is a tech lead at Google focused on user-focused reliability for Google... read more

Micah Lerner

Google

Rituraj Kirti is a Software Engineer at Meta who builds reusable patterns a.k.a ‘recipes’... read more

Rituraj Kirti

Meta

I’ve focused on building and scaling ML-driven product experiences over the past year, with... read more

Phillip Liu

Meta

Divya Sharma is a Senior Software Engineer at Microsoft with over a decade of... read more

Shridivya Sharma

Microsoft

I am a father of 3 sons; tech enthusiast whose favorite past time includes... read more

Luke Kelly

Microsoft

I am a production engineer at Meta Recommendation Systems (MRS), working on reliability backend... read more

Gaurav Mitra

Meta

UPCOMING EVENT | Data, Machine Learning and AI

AI & Data 2026

June 17, 2026 Meta Campus, Menlo Park, CA Meta’s Engineering and Infrastructure teams are excited to bring together a global contingent of engineers who are interested in building, operating, and using AI and data systems...

UPCOMING EVENT | Systems and Networking

Systems & Reliability 2026

June 25, 2026 Meydenbauer Center, Bellevue, Washington Building the advanced infrastructure necessary to power today's sophisticated AI models represents a monumental engineering challenge. This endeavor demands the creation of highly scalable, high-performance, and supremely reliable...

UPCOMING EVENT | Systems and Networking

Networking 2026

August 25, 2026 Santa Clara Convention Center, Santa Clara, CA In 2026, @Scale: Networking will continue to focus on the evolution of AI Networking. To address the growing complexity of network operations, we will examine...

UPCOMING EVENT | Mobile, Video and Web

Product 2026

October 28, 2026 Meta Campus, Menlo Park, CA @Scale: Product is an exciting evolution of the @Scale conference series, uniting the best of Product, RTC, Mobile, and Video under a single AI-native theme. We are...