EVENT AGENDA
Event times below are displayed in PT.
June 25, 2026
Bellevue, Washington
Building the advanced infrastructure necessary to power today's sophisticated AI models represents a monumental engineering challenge. This endeavor demands the creation of highly scalable, high-performance, and supremely reliable computing systems that are not just general-purpose but are meticulously tailored to the unique, often unpredictable, demands of machine learning workloads. This specialized infrastructure includes custom silicon accelerators (like GPUs and TPUs), ultra-high-speed networking fabrics, and novel storage architectures designed for massive data throughput—all optimized for parallel processing at an unprecedented scale.
Simultaneously, the very AI that this infrastructure supports is fundamentally revolutionizing the systems themselves. AI is now being deployed to design, operate, and optimize these large-scale systems, leading to a new generation of intelligent, efficient, and resilient system management. Machine learning algorithms are used for dynamic resource allocation, predictive maintenance to prevent outages, and sophisticated anomaly detection to maintain system health. This AI-driven optimization moves system management from reactive problem-solving to proactive, self-optimizing operation.
Event times below are displayed in PT.
Surupa and Peter will share the latest progress from Meta's mission to build the future of human connection and the technology that makes it possible, including infrastructure for AI by leveraging agentic innovation across the stack.
Engineering at Meta is in the midst of a productivity revolution, driven by AI. Our goal is to dramatically increase engineering velocity by using agents to build new systems and operate existing ones. Teams are not just using AI to write code faster; they’re also leveraging AI to debug GPU failures, remediate system outages and more.
To unleash agentic productivity, we have embarked on a journey to safely enable agentic access to Meta’s systems. This talk will dive into what we’ve learned so far - building internal agents, deploying 3rd party agents, observing agents' interactions with our Infra, and deploying guardrails to prevent destructive actions.
To handle the scale and velocity of AI-written code, we will have no choice but to let AI manage our infrastructure too. Yet Andrej Karpathy recently described getting an app running in production as “assembling IKEA furniture”: cloud consoles, API keys, copy-pasted config, glue, ... all things that sit outside the code an LLM can reason about. Frontier models are trained on billions of lines of real languages like Python, TypeScript, and Go, and vanishingly little bespoke DSLs and manual procedures. By modeling infrastructure in code space so the LLM can do what it does best — code — we just need an oracle that can map code changes back to infrastructure outcomes. In this talk, I’ll share what we’ve learned at Pulumi working alongside leading AI companies and frontier labs to build for a world where agents manage infrastructure. The platforms that win in this new era will look different, but many of the human-ergonomic benefits of programming languages are what will get us there.
What if your best oncall engineer never slept, investigated every alert in parallel, and got smarter after every SEV? At Meta, we're building that reality. This talk covers how Production Engineering is deploying AI agents that autonomously investigate, diagnose, and increasingly mitigate production incidents across systems serving billions of users.
We'll deep-dive into RecInvestigator — an AI agent that has investigated 300+ SEVs across Meta's Recommendation Systems, cutting p50 MTDM by 51%. We'll explain how it encodes expert investigation workflows, correlates signals across Service Router, Tupperware, ODS, Scuba, and Presto, and maintains an ~80% match rate with human investigators. We'll then show how we're using AI-driven SEV pattern analysis to move from reactive firefighting to proactive prevention — identifying systemic failure modes, proposing architectural fixes, and prototyping agents that auto-mitigate service regressions during business hours.
The talk will cover the full journey we're on: (1) AI-powered pattern identification across hundreds of SEVs, (2) autonomous investigation agents that match human accuracy, (3) fixer agents that propose and implement reliability improvements in code, and (4) the moonshot — AI-native reliability where systems self-heal and development is reliability-aware by default. We'll share what worked, what didn't, and what we've learned about the trust boundary between autonomous agents and human engineers in production.
The rapid, exponential growth in model capabilities and training dataset sizes over the last few years has accelerated AI innovation. New frontier models are now being released in a matter of weeks, down from months just a year ago. This pace makes reliable, consistent fast access to storage essential for managing both the speed and cost of development. This presentation will detail how we evolved Meta's Storage Architecture to overcome two key hurdles: optimizing GPU Utilization and maximizing Research Velocity.
The NVIDIA GB200 NVL72 functions as an exascale computer within a single rack. Maximizing the utilization of its NVLink fabrics through proper scheduling allows AI training jobs to significantly leverage the substantial networking bandwidth this system offers. Notably, recent MLPerf training results (>2.6x improvement with recent MLPerf training) confirm that GB200 NVL72 provides substantial performance gains across all AI workloads. This presentation will cover key topics related to operating and optimizing large GB200 clusters, including:
Topology-aware scheduling within Slurm.
Recommendations for optimal GPU occupancy and scheduling (slurm simulator).
Our experience bringing up and running a large GB200 cluster. (Healtch checks, Supporting large scale training experience, scheduling, rack testing, slow nodes etc)
Building privacy-aware infrastructure in the AI-native era requires systems that can learn, evaluate, and operationalize improvements continuously—at the scale of millions of data assets. This talk presents an AI-native approach through a case study of AI-native asset classification: a hybrid deterministic rules engine with LLM fallback that transforms messy context (metadata, lineage, code references, scan signals) into enforceable privacy controls. We’ll highlight ground truth (GT) + evaluation (eval) self-improvement loops, where active learning focuses human review on high-value examples, an LLM-as-Judge provides scalable review signal (including kappa reliability), and evaluation gates prevent regressions and circular reinforcement. Finally, we’ll show how high-performing LLM behavior is distilled into auditable rules.yaml, driving LLM usage toward near-zero while improving determinism, latency, and cost. Attendees will leave with practical patterns for building privacy-by-design infrastructure that converges, exports human-readable logic, and monitors drift in production.
This presentation will go over how Microsoft uses SSH to debug across our global fleet of K8s clusters at hyperscale. We'll cover how we integrate that with JIT elevations and Task-Based Execution in a secure and auditable way. We will alsogo over how we use these building blocks to engage AI and the Semantic Kernel to empower our on-call engineers to resolve issues quickly.
In March 2025, Meta avoided an major outage that would have resulted in prolonged sitewide unavailability. Instead, we were able to quickly contain the issue in a matter of minutes and avoid the worst case scenario. This is a story of how we re-bootstrapped the infrastructure control plane in an entire region, on the fly, using emergency tooling. It's a showcase of how to put DR-preparedness and calm incident response into practice. And it's a pretty cool story.
TBD
NCCL watchdog timeouts are a common failure mode in distributed AI model training. They impact not only Meta, but broadly affect anyone running PyTorch distributed training—and they’re notoriously hard to debug: even experts can spend hours triaging a single incident, and non-experts may be unable to root-cause them at all. Over the past year, we investigated NCCL watchdog timeouts internally at Meta and partnered with the PyTorch community to categorize the major root-cause buckets. We then distilled these learnings into a practical decision tree and runbook to speed up triage and make debugging more accessible. We also explored using agent-based approaches to assist root-cause analysis and saw strong early results. (More details TBD.)
Panel information to be announced!
Surupa and Peter will share the latest progress from Meta's mission to build the future of human connection and the technology that makes it possible, including infrastructure for AI by leveraging agentic innovation across the stack.
Surupa Biswas is the Vice President of Engineering responsible for Core Infrastructure at Meta,... read more
Peter Hoose is the head of Production Engineering at Meta. PE is a unique... read more
David Pariag is a software engineer focusing on Meta’s Monitoring Products. He’s spent the... read more
Joe Duffy is Founder and CEO of Pulumi, a venture-backed Seattle company bringing programming... read more
I am a production engineer at Meta Recommendation Systems (MRS), working on reliability backend... read more
As a Director of Engineering at Meta, Sidharth focuses on evolving the company's core... read more
Venkat is a Software Engineer at Meta currently focused on evolving the Storage stack... read more
Ankur Srivastava is an ML Engineer at NVIDIA specializing in AI performance and efficiency.... read more
Sachin Lakharia is a Principal Engineer at NVIDIA, where he leads multiple projects focused... read more
Rituraj Kirti is a Software Engineer at Meta who builds reusable patterns a.k.a ‘recipes’... read more
Divya Sharma is a Senior Software Engineer at Microsoft with over a decade of... read more
I am a father of 3 sons; tech enthusiast whose favorite past time includes... read more
Phil is a Software Engineer at Meta who works on sitewide reliability, scalability, and... read more
Rahul is a production engineer at Meta. read more
Ioannis Papapanagiotou is a Principal Engineer at Google. Ioannis is also a research assistant... read more
I’ve focused on building and scaling ML-driven product experiences over the past year, with... read more