June 25, 2026

Teaching AI to Fight Fires: Building the Reliability Flywheel at Meta

Gaurav Mitra

In December 2024, a configuration change at Meta caused a widespread routing error across every region. Within minutes, BGP (border gateway protocol) processes were crashing across multiple regions. The rollback could not restore a clean state because the routing was already inconsistent. Persistent packet loss spread across every product. We declared a SEV 0.

It took a large cross-team effort, continuous incident management, physical data-center power cycles, and several days of follow-up recovery work to fully restore service. Following industry-standard incident response, we halted ML training, kill-switched deployments and disabled self-healing systems to prevent further damage.

This is what firefighting looks like at scale.

What if an agent could detect that kind of routing anomaly in seconds, correlate the config change, and propose the rollback before the cascade propagates? In that scenario, the cascade is averted entirely.

In the year since that outage, we’ve been building just that: a system where the best on-call engineer’s expertise is available 24/7, encoding their knowledge and methods into an always-on copilot. The agent handles the mechanical work of correlating signals across systems so the engineer can focus on judgment: what’s actually broken, how impactful is the fix, what’s the blast radius. The engineer’s expertise becomes the fuel; the agent is the engine that carries it forward across every shift, every incident, every service.

Our system learns from each investigation. Evals measure what worked and what didn’t in the agent’s reasoning; the resulting patterns are automatically encoded into domain-specific workflows that make the next investigation faster. A workflow is the encoded version of a senior engineer’s investigation playbook: which signals to check, in what order, and how to interpret them.

We call this loop the reliability flywheel. Below, I’ll describe how we built autonomous investigation agents that have reduced detection-to-mitigation time by 60%.

Why expert knowledge needs to scale

Meta’s infrastructure serves billions of requests across thousands of interconnected services. When something breaks, the investigation surface is enormous: time-series metrics, deployment logs, configuration changes, feature flag rollouts, container health, network topology. A single incident might require correlating signals across several data systems.

The numbers tell the story: thousands of production incidents per year. A median detection-to-mitigation time measured in hours, not minutes. Each investigation demands deep domain knowledge about how specific services behave, what their failure modes look like, and which signals matter most.

The challenge isn’t that engineers lack the capability to handle this. They clearly can and do, often heroically, at 3:00 am on a Saturday with incomplete information and mounting pressure. The challenge is that expert knowledge is the most valuable asset in reliability engineering, and today most of it lives only in the heads of engineers who built it. When your best on-call engineer goes off shift, that expertise goes with them. The next engineer on rotation rediscovers what someone else already learned. The goal is to make the hard-won knowledge persistent and shareable, so every engineer benefits from what every other engineer figured out. 

So we asked ourselves: Can we make expert-investigation knowledge persistent and available even when the expert isn’t on shift? Can we encode what the senior on-call engineer does instinctively, the sequence of checks, the pattern recognition, the I’ve seen this before intuition, into a system that any engineer can benefit from at any hour?

The answer turned out to be less about building a smarter AI and more about building a better way to capture and operationalize human expertise.

The reliability flywheel

The reliability flywheel is a reinforcing loop with four stages. Each turn makes the next one faster, and the human expert sits in the center of the flywheel, in the loop rather than on the loop, continuously refining how the system works.

Stage 1: Pattern identification 

Cluster and classify incidents semantically. Instead of treating each incident as an isolated event, the system discovers recurring failure modes that span teams and services. A config regression in one service might share a root-cause pattern with an unrelated outage in another. Humans identify these cross-cutting patterns and teach the agent to recognize them.

Stage 2: Investigation 

When an incident fires, the agent selects the appropriate expert workflow and begins correlating signals across data systems. It queries metrics, checks recent changes, inspects infrastructure health, and produces ranked root-cause hypotheses with evidence links. The investigation logic reflects what the senior on-call engineer would do, because they’re the ones who encoded it.

Stage 3: Mitigation 

The agent proposes remediation steps. In early stages, this is purely advisory: Here’s what I think happened, and here’s what I’d recommend. As confidence in the agent grows, it begins executing low-risk, reversible actions with human oversight. As trust is earned, the scope of autonomous action expands incrementally.

Stage 4: Self-healing 

After every investigation, the human expert reviews what the agent did. Did it check the right sources? Did it miss a signal? Was the hypothesis correct? The engineer refines the workflow, adds new patterns, adjusts the decision logic. This encoding step is the teaching moment where human expertise flows back into the system. Repeated enough times, it produces self-healing: classes of failure the agent can detect, diagnose, and resolve without human intervention. The flywheel doesn’t just speed up investigations; over time, it eliminates them.

on_incident(alert): 
workflow = select_expert_workflow(alert) # encoded by senior oncall
evidence = gather_signals(workflow.sources) # parallel data collection
hypothesis = reason(evidence, workflow.logic) # LLM synthesizes
propose_mitigation(hypothesis) # human reviews and decides
encode_learnings(outcome) # human refines for next time

The flywheel’s power comes from the compounding effect. Every incident the agent investigates makes the next investigation faster. But the flywheel only spins because engineers keep teaching it. Humans aren’t adjacent to the system; they’re in the loop, orchestrating it, curating and extending the encoded knowledge that makes the agent useful.

From prompt engineering to context engineering

Early in the project, we spent weeks crafting prompts. We tried to write instructions detailed enough that the LLM would reason correctly about any incident type. It didn’t work. Prompts are fragile; they break when the model encounters a failure mode that wasn’t anticipated in the instruction text.

The breakthrough came when we stopped optimizing what we asked the model and started optimizing what the model could see.

We call this shift “context engineering”: designing the right tools, data sources, and expert workflows so the agent can investigate any incident, rather than writing clever prompts for specific incident types. The model’s job is to reason about what to investigate and synthesize findings. All data comes from verified sources through structured tool interfaces. The LLM never generates metrics, never fabricates log entries, never invents deployment histories. The architecture reflects this principle. An orchestration layer connects the LLM to a set of tool servers, each exposing a structured interface to a specific data system:

  • time-series databases for metric queries and anomaly detection
  • log analytics for error-pattern search and aggregation
  • configuration trackers for recent change correlation
  • container orchestrators for infrastructure health inspection
  • deployment pipelines for rollout status and canary results

The integration layer is built from four composable primitives:

  • skills that encapsulate investigation techniques
  • plugins that wrap data sources with structured interfaces
  • agent teams that coordinate parallel investigations across service boundaries
  • sub-agents that let domain experts package their expertise as deployable units

Adding a new data source or investigation capability means composing a new skill or plugin, not rewriting prompts or investigation logic. The practical difference is significant. A prompt-engineered agent might say, “Check if there was a recent deployment,” and the quality of that check depends entirely on how well the prompt was written. A context-engineered agent calls the deployment tracker, gets the actual deploy list with timestamps, cross-references them with the metric anomaly window, and presents the correlation with linked evidence.

The engineer’s new role isn’t writing better prompts; it’s orchestrating what the agent sees, which skills are composed, what data sources are available, and what expert workflows guide the investigation sequence. This is a more robust and extensible approach, and it’s where human expertise matters most. Senior on-call engineers don’t need to learn prompt engineering. They need to articulate their investigation process clearly enough that it can be encoded as a reusable skill, and that’s something they already know how to do.

Anatomy of an AI-assisted investigation

Here’s what an agent-assisted investigation looks like in practice, drawn from a real (anonymized) incident.

At 2:00 am on a Saturday, error rates spike on a recommendation service. The on-call engineer’s phone buzzes, but the investigation agent is already working.

Step 1: Alert ingestion

The agent classifies the incoming alert by service, severity, and symptom type. Based on this classification, it selects an expert workflow, the investigation playbook that the senior on-call engineer for this service encoded after handling dozens of similar incidents.

Step 2: Parallel data gathering

The agent fans out queries to multiple data sources. This is where context engineering pays off: Instead of a sequential, human-paced investigation, the agent exploits parallelism to compress what would take an engineer many minutes of dashboard-hopping into just seconds. A set of sample queries that an agent would make across different types of data sources is shown below:

signals = await gather(
query_error_rates(service, window="30m"),
check_recent_deploys(service, window="2h"),
inspect_container_health(service),
check_config_changes(service, window="6h"),
check_feature_flags(service, window="24h"),
)

Step 3: Hypothesis generation

The LLM synthesizes the gathered signals. In this case, error rates correlate temporally with a configuration change deployed 45 minutes earlier. Container health is normal. No feature flag changes in the window. No recent code was deployed. The agent produces a ranked hypothesis: “Configuration regression with 85% confidence, based on temporal correlation between config change at 01:17 and error rate inflection at 01:22.”

Every element of this hypothesis links back to a specific tool query and its result. The agent shows its work.

Step 4: Mitigation proposal

The agent proposes a rollback of the specific configuration change, including the exact config key, the previous value, and the command to execute it. The on-call engineer reviews the evidence chain, agrees with the assessment, and approves the rollback.

Step 5: Encoding

The agent logs the investigation. The engineer refines the workflow with what they learned: for this service, config changes are a more frequent root cause than code deployments. The next time this pattern appears, config changes will be checked first, shaving additional seconds off the investigation. 

The on-call engineer’s experience throughout this process is fundamentally different from the traditional model. Instead of spending time correlating dashboards while half-awake, piecing together a timeline from fragmented data, the engineer receives a structured brief: Here’s what changed, here’s the evidence, here’s what I recommend. The engineer’s job, rather than starting from scratch, is to validate and decide, bringing their judgment and contextual knowledge to bear on a well-prepared investigation.

The agent handles the mechanical work of data gathering and correlation. The engineer handles the judgment calls that require understanding of business context, risk tolerance, and the subtle “something feels off” intuition that comes from experience. Together, they’re faster and more reliable than either would be alone.

What we got wrong, and what we learned

Building this system taught us several lessons that we wish we’d known earlier.

Trust is earned through transparency 

The first version of the agent produced confident-sounding recommendations backed by incomplete evidence. Engineers quickly learned to ignore its output. We discovered that trust isn’t about accuracy alone; it’s about auditability. We added a hard constraint: Every conclusion must trace to a specific data query and its result. If the agent can’t cite its source, it stays silent. This single change transformed adoption. Engineers don’t need the agent to be right every time. They need to be able to verify its reasoning quickly.

Guardrails are architectural, not instructional 

The agent never generates data. It only queries, reasons, and cites. This isn’t a prompt instruction that the model might ignore; it’s a hard architectural constraint. The LLM has no mechanism to produce a metric value or log entry. It can only invoke tool servers that return real data. When an LLM hallucinates a metric value during a production incident, the consequences can be catastrophic. We designed hallucination out of the system rather than prompting against it.

Autonomy follows a ladder

We didn’t start with auto-mitigation, and trying to skip steps would have destroyed trust. The progression matters:

  • Level 1: Automated audit. The agent scans service health against SLOs and produces reports. Read-only. Engineers evaluate the reports and build confidence in the agent’s observational accuracy.
  • Level 2: Investigation copilot. The agent proposes root-cause hypotheses during active incidents. The human decides what to do. This is where most of the value is delivered today.
  • Level 3: Supervised mitigation. The agent executes low-risk, reversible actions (config rollbacks, traffic shifts) during business hours with human oversight. We’re beginning to expand into this level.
  • Level 4: Self-healing. The agent prevents incidents before they cascade. This is the long-term vision, where the flywheel spins fast enough that many failure modes are caught and resolved before they become incidents at all.

Each level is earned by demonstrating reliability at the previous level. The ladder isn’t just a technical roadmap; it reflects the trust-building process between engineers and the system they’re teaching.

Measure what matters to the engineer

Early on, we measured, “Did the agent identify the root cause?” But the real question is, “Did the agent help the engineer reach the right answer faster?” An agent that identifies the correct root cause in 10 minutes but presents it in a way that takes 20 minutes to verify hasn’t saved any time. Accuracy without usability is academic. We shifted our metrics to focus on end-to-end time from detection to mitigation, because that’s what the engineer actually cares about.

Results: Spinning the flywheel

After a year of building, encoding, and refining, the results reflect what the flywheel makes possible.

  • 60% reduction in detection-to-mitigation time, from a median of approximately 9.5 hours to 3.8 hours. This isn’t the agent working alone; it’s the agent and the engineer working together, with the agent handling data gathering and the engineer handling judgment.
  • 1000+ incidents investigated by the agent across multiple service domains.
  • ~80% match rate with human-expert conclusions. When the agent and a senior on-call engineer independently investigate the same incident, their top hypotheses align four out of five times.
  • ~40% correct root-cause identification on the first hypothesis. This number deserves honest context. 40% sounds low until you consider that the agent produces its first hypothesis in minutes, not hours. Even when the first hypothesis is wrong, it gives the engineer a concrete starting point and a body of gathered evidence to work from. And every miss feeds back into the encoding step, making the next investigation more accurate. The flywheel compounds.

Remember the December 2024 outage? An agent watching the routing telemetry would have detected the anomaly within seconds of the config change. It would have correlated the change with the failure in under a minute. The cascade that required a large cross-team effort and days of recovery might instead have been a 15-minute investigation and a targeted rollback.

Our target for 2027: 95% reduction in detection-to-mitigation time, bringing the median under 30 minutes. Not because the agent gets smarter on its own, but because engineers keep encoding better investigation workflows, the pattern library keeps growing, and the flywheel keeps accelerating.

From fighting fires to preventing them

The reliability flywheel extends beyond individual incident response.

The platform effect

The investigation platform is designed so any team can encode their domain expertise as workflow modules. A recommendations team, an ML-training team, an ad-serving team, and a feed-ranking team each can build domain-specific investigation agents on top of a shared orchestration layer, tool-server infrastructure, and context store. The platform handles LLM management, tool integration, and workflow execution. Each domain team focuses on what they know best: their system’s failure modes and how to investigate them.

Pattern analysis at scale

When hundreds of incidents are clustered semantically, systemic failure modes emerge that no single on-call engineer would spot. A semantic-clustering pipeline groups incidents by root-cause similarity, and agents decompose each cluster into sub-patterns, surface cross-service dependencies, and rank prevention investments by blast radius and operational cost. This transforms reactive firefighting into proactive reliability engineering, with humans guiding the analysis and prioritizing the investments.

The engineer’s evolving role

Investigation speed will always matter, but the highest-leverage work is shifting toward encoding investigation patterns that scale across teams and incidents. Engineers who see a novel failure mode and turn it into a reusable workflow multiply their impact: Every other engineer benefits the next time the pattern shows up. The repetitive mechanical work of correlating dashboards at 3:00 am gives way to the creative work of designing investigation strategies and refining the system’s understanding of how production services fail. 

The flywheel keeps spinning: Detect. Investigate. Mitigate. Heal. Repeat. Every turn makes the next one faster. Every encoded workflow makes the next engineer’s shift better. Every pattern discovered makes the next investment decision clearer.

We started by asking: What if expert knowledge was always available, even when the expert wasn’t on shift? The answer turned out to be less about the AI and more about the humans who teach it. The agent is only as good as the expertise encoded into it.

What’s really happening is a shift in abstraction. Engineers used to operate at the level of individual dashboards and log queries. Now they’re operating at the level of investigation strategies and encoded workflows. The mechanical work moves down a layer, and the engineer moves up, spending their time on the harder, more creative problems: designing better investigation patterns, identifying systemic failure modes, and teaching the next generation of agents to handle the cases that today’s agents can’t.

The engineers are still the firefighters. They just work at a higher altitude now, and the knowledge they build today helps every engineer who comes after them.

From fighting fires to preventing them. The flywheel keeps spinning, and the on-calls sleep a little better.  


Gaurav Mitra is a Production Engineer at Meta, where he leads reliability for modern recommendation systems (MRS) powering Facebook, Instagram, and Threads. This post represents the work of several teams across MRS and is a companion to his talk at @scale 2026: Systems & Reliability.

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy