AI & Data 2026

JUNE 17, 2026 @ 9:15 AM PDT - 5:00 PM PDT

June 17, 2026
Meta Campus, Menlo Park, CA

Meta’s Engineering and Infrastructure teams are excited to bring together a global contingent of engineers who are interested in building, operating, and using AI and data systems at scale.

This year, the conference theme is AI Native Transformation, with a focus on two key areas. Our in-person talks and panels will delve into the latest advancements in Recommender Systems, alongside discussions on the era of Agents, specifically focusing on their orchestration, autonomy, and the transformation they are bringing to engineering and research practice. Attendees can expect to gain practical knowledge and strategies for building AI-powered products, as well as a deeper understanding of the evolving ecosystem.

RSVPS CLOSED

AGENDA SPEAKERS

EVENT AGENDA

Event times below are displayed in PT.

Schedule

Poster Sessions

08:30 AM - 09:45 AM

Attendee Registration

08:30 AM - 09:45 AM

Breakfast & Poster Sessions

09:45 AM - 09:50 AM

Event Welcome

Speaker Faisal Siddiqi,Meta

09:50 AM - 10:15 AM

Keynote from Meta

WATCH NOW

Speaker Barak Yagour,Meta

10:15 AM - 10:50 AM

Fireside Chat with Boris Cherny, Head of Claude Code

WATCH NOW

Speaker Boris Cherny,Anthropic

Moderator Jesse Chen,Meta

10:50 AM - 11:10 AM

Data Governance in the World of Agents

WATCH NOW

AI agents are rapidly moving from demos to production, acting autonomously across tools, data systems, and workflows—and in the process, they amplify data movement far beyond what traditional governance models were designed to handle. Data security controls built for humans break down when agents operate at machine speed, execute in parallel, and persist sensitive information across new data surfaces like trajectories, embeddings, logs, and tool outputs.

In this talk, we outline the emerging data governance failures in agentic architectures—identity confusion for data access, entitlement creep, recursive leakage across agent chains, new data constructs leading to old controls becoming obsolete, and why out of box agent harnesses and existing IAM are insufficient. We then present Meta’s governance-first approach for safely enabling agents at scale: a defense-in-depth stack centered on Isolation Domains (domain-scoped encryption and output closure), Agent Identity (end-to-end attribution distinct from the user), Agent-Aware Access Control (classification-aware ABAC evaluated at query time), AccessMate (zero-standing-permissions access triage and least-privilege fallback), CodeGuard (secure code generation and runtime execution guardrails), and DataVM—a unified trusted data environment that bounds inputs, tools, and outputs under one governed scope.

Attendees will leave with a concrete reference architecture for building agents that are not merely powerful, but governable, auditable, and regulatory-ready—turning governance from a blocker into the harness that safely unlocks agent autonomy.

Speaker Komal Mangtani,Meta

Featured Article

Data Governance in the World of Agents read more

11:10 AM - 11:30 AM

Bridging the Intent Gap in Agentic Systems

WATCH NOW

Speaker Anoop Deoras,AWS

11:30 AM - 11:55 AM

Why have we not solved security of agents?

WATCH NOW

The security community spent decades building rules and frameworks that made systems harder to break. AI has fundamentally upended those lessons -- attackers are now more enabled than ever, and traditional defences don't translate. This talk examines prompt injections, indirect prompt injections, and jailbreaks, showing why each resists simple fixes. Drawing on hands-on experience building AI security tools, I'll demonstrate why rules-based approaches fail against systems that interpret natural language as instruction. But there is hope: I'll share defensive approaches that actually work and outline a credible path toward resilient AI systems.

Speaker Ilia Shumailov,Meta

11:55 AM - 12:15 PM

Q&A Session

WATCH NOW

Moderator Komal Mangtani,Meta

Speaker Anoop Deoras,AWS

Speaker Ilia Shumailov,Meta

12:15 PM - 01:30 PM

Lunch & Poster Sessions

01:30 PM - 02:10 PM

Live Panel: Agentic autonomy & evolution of software & research

WATCH NOW

Moderator Faisal Siddiqi,Meta

Panelist Henry Eskrine Crum,Meta

Panelist Joe Spisak,Reflection AI

Panelist Jessica Fu,Meta

Panelist Xing Chen,Databricks

Panelist Matt Schlicht,Meta

02:10 PM - 02:30 PM

Tuning Your Algorithm with MRS Memory System and Think-Then-Recommend (TTR)

Users have long wanted to understand and control the algorithms that shape their recommendations, but enabling meaningful user agency over recommendations has remained challenginging — until now. We present Tune-Your-Algorithm (TYA), an AI-powered agentic recommendation system on Instagram that gives users transparent visibility into the recommendation algorithm and the ability to tune it using natural languages.

TYA is built on two key innovations: (1) the MRS Memory System (Biography), an LLM-based framework that summarizes user engagement histories into rich, structured, and interpretable user interest and intent representations at scale; and (2) Think-Then-Recommend (TTR), a reasoning-augmented approach that decomposes user interests and complex user intents into personalized sub-goals for personalized and contextualized recommendations.

Early results show strong product-market fit with positive user feedback on transparency and user agency enabled by TYA. We discuss the end-to-end architecture, production learnings, technical challenges we are actively tackling, and the path towards our north star vision.

Speaker Qi Guo,Meta

02:30 PM - 02:55 PM

Observability: Role of Evals, Benchmarks & Data in Frontier AI

WATCH NOW

The excitement around agentic AI is real — backed by quantitative progress on model cards and genuine leaps in capability. But our ability to measure AI has been outpaced by our ability to develop it, and closing this evaluation gap is one of the most important problems facing the field. More enduring benchmarks are needed to advance the next vectors of capability and chart the path to reliable agents.

In this talk, Snorkel AI Co-Founder and CEO Alex Ratner will share insights from major research and benchmark collaborations on agentic coding and continual learning, along with practical tips from working with global frontier labs and leading academics. He'll focus on three dimensions where today's models most often break down, and where the next generation of benchmarks will need to deliver real signal: environment complexity (how dynamic and rich the operating world is), autonomy horizon (how far an agent can act independently), and output complexity (how sophisticated and verifiable the deliverable is).

Speaker Alex Ratner,Snorkel AI

02:55 PM - 03:20 PM

How Meta Scaled AI Training Storage via Data Normalization

WATCH NOW

Training data for Meta's recommendation systems was entirely stored in Data Warehouse, structured as relational tables where each row captures labels and snapshotted features at the point of recommendation.

New modeling techniques, such as learning from user sequences and multi-modality, has led to a 10-100x increase in feature size, making the training data increasingly cost-prohibitive due to high duplication. The same user's features are stored repeatedly for every recommendation request, with highly popular content features being duplicated potentially over a million times.

We present a co-designed data and infrastructure in order to address the scaling challenge. By moving features out of training samples into a high-performance indexing storage and implementing model access pattern-aware pushdown optimizations, we have achieved a 10x storage cost reduction for the largest feature: long user sequences.

Speaker Sarang Masti,Meta

Speaker Weiran Liu,Meta

03:20 PM - 03:40 PM

Break

03:40 PM - 04:00 PM

Architecting Infrastructure for the AI Native Future: Scaling Autonomous Agents on Google TPUs

WATCH NOW

As the industry pivots toward an "AI Native" paradigm, the bottleneck for innovation has shifted from algorithmic design to the underlying infrastructure's ability to handle unprecedented scale and complexity. This session explores how Google TPU (Tensor Processing Unit) infrastructure serves as the catalyst for this transformation, specifically within the domains of large-scale Recommender Systems, MoEs, LLMs and the emerging era of Autonomous Agents.We will delve into the architectural innovations of the latest TPU generations, demonstrating how their purpose-built design facilitates the massive throughput required for real-time recommendation engines and the high-speed inference necessary for agentic orchestration.

Speaker Sabastian Mugazambi,Google

04:00 PM - 04:25 PM

Agentic Data at Scale: Transforming Data Experiences at Meta

WATCH NOW

Every major business decision is ultimately built on data, yet getting to the right answers has long required specialized expertise, from knowing which tables to query to how to query them to building the right data applications. AI agents are fundamentally changing that equation, making data accessible to anyone who can ask a question in plain language. At Meta's scale, with millions of datasets serving tens of thousands of decision-makers, this shift creates both massive opportunity and unique challenges around trust and accuracy. This presentation will detail how we built AI-native data experiences to address two key dimensions: enabling trusted answers through agentic data consumption, and letting users create shareable agentic data applications without writing a single query.

Speaker Dinkar Pataballa,Meta

04:25 PM - 04:30 PM

Closing Remarks

04:30 PM - 06:00 PM

Happy Hour & Poster Sessions

Agent Memory at Meta Scale: One Query Language for Graph, Vector, and Memory

Authors: Abdullah Ozturk, Yiting Li, Vishal Gandhi, Ming Chen, Junjie Qi

Agent memory is the new scaling axis for AI assistants. Useful agents accumulate episodic conversations, semantic patterns, working state, and procedural skills — and every retrieval composes vector similarity, graph traversal, and structured filters over them. Today this lives across fragmented APIs (vector indexes, graph services, RAG frameworks, per-product memory stores), forcing every agent stack to stitch three or four services together on the critical path of every turn.

We present the Cypher Query Layer (CQL) — an AI-native, declarative query surface for agent memory built on the openCypher / GQL standard (ISO/IEC 39075:2024). CQL collapses hybrid retrieval (vector + text + graph + scope + freshness + provenance) into a single optimizable query. A pluggable backend architecture routes each query to its right-fit substrate — in-process Velox for sub-50 ms agent-loop reads — while reusing Meta's existing storage (graph, vector, hybrid). On the roadmap: composable procedure packs for memory management (distillation, consolidation, governance), and a distributed MPP backend for corpus-scale retrieval.

Unlike per-product memory SDKs, CQL treats memory as a query surface rather than an API: graph, vector, and structured filters compose in one plan, with identity-aware ACLs as first-class primitives.

We share the architecture, production patterns from an early agent-platform consumer at Meta, the design we are pursuing for bi-temporal memory to enable agent-debugging replay, and where the in-process vs distributed boundary actually falls in practice.

Agentic Data Flywheel: A Multi-Agent System for Scalable Multimodal Training Data Production

Authors: Wendy Jiang, Selahattin Akkas, Alan Li, Antoine Simoulin

We present an agentic data flywheel that automates the production of high-quality multimodal training data at scale. The system orchestrates multiple specialized VLM agents - a labeler, parallel quality reviewers with a summarizer, and a self-correcting adjuster - to automatically annotate, review, and fix image-text annotations with minimal human intervention. When agents disagree, images and labels are escalated to human review; When the adjuster agrees with corrections, fixes are applied automatically.

The multi-agent system wraps probabilistic VLM reasoning inside deterministic scaffolding: a fine-tuned specialist VLM produces structured annotations, an ensemble of concurrent evaluators independently assess grounding quality, text accuracy, and semantic classification quality; and a thinking-model powered judge arbitrates disagreements - overruling false positives, diagnosing root causes, or applying targeted fixes to salvage otherwise-human intervention required labels. Key design principles include agent isolation to prevent confirmation bias, deterministic routing around probabilistic judgement, and structured output constraints that force closed-choice decisions with every failure mode defaulting to human escalation.

The system delivers a 3X efficiency gain: 60%+ of annotations pass with no human involvement, and remaining cases are routed to human annotators prefilled with agent corrections and targeted hints - transforming the task from annotation to verification and improving labeling quality. A key enabling capability is multilingual bootstrapping: with multilingual VLMs, the flywheel enables scaling annotation across 100+ languages, transforming what was a per-language staffing bottleneck into a prompt configuration change.

We demonstrate the end-to-end system on an image text labeling task and discuss generalization to scene understanding, chart understanding, document VQA, and grounding & localization.

Agentic Inference Optimization for RecSys ML models

Authors: Shuyao Bi

Modern recommendation models built in PyTorch require multi-stage post-training optimization to meet production inference latency targets. Kernel fusion that replaces sequences of individual operations with a single fused kernel via Torch FX graph transformations is one of the most impactful techniques, reducing kernel launch overhead and memory transfer costs to deliver measurable throughput gains.

Traditionally, identifying fusion opportunities and authoring correct graph transformations has required deep expert knowledge and ~40 hours of manual iteration per model. Transformation choices are fragile to model updates and quickly become stale, creating a scalability bottleneck as model architectures evolve at increasing velocity.

We present an automated optimization agent that replaces this manual process: it detects fusion opportunities from GPU traces and FX graph structure, generates and validates graph transformations — all orchestrated through an iterative, self-converging workflow.

ASEA: Autonomous Agents for ML Serving Efficiency at Scale

Authors: Bikash Sharma, Aashik Gowda, Angineh Keshishian, Mia Gu, Michael Fulthorp, Jinfu Leng

Large-scale ML serving at Meta runs hundreds of models across CPUs, GPUs, and MTIA, consuming hundreds of megawatts. Service overhead — the gap between theoretical minimum and actual resource usage — often reaches 50% of total footprint, yet the optimization loop remains human-bottlenecked: engineers manually stitch together telemetry, profiling, and capacity data in a multi-day workflow that leaves millions of dollars of efficiency on the table every quarter.

We introduce ASEA (Autonomous Agentic Platform for Large-Scale ML Serving), an AI-powered agentic platform with dozens of specialized LLM agents that autonomously orchestrates the full lifecycle of Meta's ML serving infrastructure optimization — from opportunity discovery through root-cause diagnosis, configuration generation, canary validation, and production rollout.

Each agent owns a single skill (capacity sizing, autoscaling, batching, routing, hardware migration, regression detection, etc) and shares a unified data foundation spanning the fleet. In production, ASEA has delivered tens of millions of dollars in validated efficiency savings and turned a multi-day, expert-only workflow into an autonomous, fleet-wide one.

Cache the Session, Not the Data: 40x Faster ML Iteration with a Shared Sidecar at Meta

Authors: Rohil Bansal, Eric Fu

ML training iteration on developer workstations pays a heavy data preprocessing tax. Each launch of a model spins up its own data preprocessing client, which has to establish a session with the backend, run query optimization over the user's preprocessing plan, perform metadata lookups to translate logical queries into physical data locations, and warm up its readers. In our environment this is on the order of 5+ minutes per launch. For ML engineers iterating on trainer changes, this cold-start cost dominates the inner loop and discourages tight iteration cycles.

We present the Shared Sidecar(SSC): an out-of-process daemon that caches warmed preprocessing sessions across local runs on a developer workstation. Sessions are keyed on a hash of the input dataframe and the preprocessing plan; subsequent runs whose hash matches reuse a pre-warmed session and skip the entire startup path. The sidecar handles session lifecycle (8h idle eviction, 1h whole-process timeout), bounded memory through a warmed-session cap, and graceful degradation when the daemon is unavailable.

SSC has been rolled out to all of our developer workstations, enabled by default, with no correctness regressions. On repeat launches of the same configuration we measure roughly 40× speedup (benchmark: a click-through-rate model dropped from 391s to 9s).

Citrine: Catching Million-Dollar ML Anti-Patterns Before They Ship

Authors: Prakash KL; Chenguang Zhu, Shyam Sundar Chandrasekaran; Jon Dyer

Ensuring ML code efficiency is a multi-million-dollar problem at Meta's scale. Common PyTorch (https://pytorch.org) anti-patterns such as unpinned DataLoader memory, redundant host-to-device transfers, and suboptimal optimizer configurations silently degrade training performance across thousands of jobs. Triton GPU kernel code exhibits analogous pathologies at the kernel layer, including flattened loops that forgo proper tiling. A single one-line misconfiguration can cost thousands of dollars per trainer; replicated fleet-wide — and across the broader AI industry, where hyperscalers collectively spend tens of billions of dollars annually on GPU compute and draw megawatts of grid power — the waste compounds into massive losses of capital and energy. As AI coding agents now author a growing share of production ML code, these inefficiencies are being reproduced at machine speed and at unprecedented volume, making detection at the code-authoring phase the only scalable defense.

We present Citrine, an always-on, zero-overhead efficiency system that detects 45+ ML efficiency anti-patterns spanning both PyTorch core and Triton kernel code through static analysis and AST transformation built on top of LibCST (https://github.com/Instagram/LibCST), and is integrated directly into Meta's arc lint pipeline. Every detector ships with a deterministic AST rewriter that surfaces a one-click suggested edit on every diff under review. In the past 90 days, Citrine has landed 4,700+ accepted fixes across 10+ product groups — including Generative AI, Reality Labs, Instagram, Monetization and Ads — saving millions of dollars annually in GPU compute waste and yielding measured improvements of up to 43% on affected Triton kernels.

Beyond Meta-internal impact, Citrine has become the de facto home for TorchFix (https://github.com/pytorch/torchfix), the open-source PyTorch hygiene project: 13 TorchFix rules — covering deprecated-symbol migrations, the unsafe torch.load deserialization vector, common API typos, and TorchVision migrations — now ship as first-class Citrine patterns, giving developers a single integration point for PyTorch-ecosystem hygiene alongside specific vetted efficiency rules. Through partnership with the open-source Triton compiler team (https://github.com/triton-lang/triton), Citrine also integrates 12+ Triton-lint rules detecting kernel-level pathologies such as missing @triton.autotune decorators, accumulator-precision regressions, warp-divergent control flow, and barrier deadlocks — unifying static analysis across the high-level PyTorch layer and the low-level GPU kernel layer in a single linter.

To address the rapid growth of agent-authored ML code, we further transformed Citrine from a reactive lint tool into a proactive efficiency system by shifting left into the model-authoring workflow. By encoding anti-pattern knowledge directly into Meta's LLM code-authoring framework, PyTorch-specific efficiency rules are injected into AI coding agents at code-generation time, ensuring that both experimental and production code is free of canonical inefficiencies before it is ever written. We present this end-to-end architecture in the context of the software development lifecycle (agentic codegen → linting → CI/Diff → ship), together with an attribution methodology that connects individual lint fixes — both human and AI-authored — to validated efficiency wins at fleet scale.

The approach Citrine pioneers — uniting static analysis with AST-driven remediation at the moment of authorship and reinforcing that knowledge in the LLM-codegen layer — generalizes well beyond Meta. As AI training infrastructure expands toward multi-gigawatt scales and agentic development becomes the standard mode of ML engineering, code-time efficiency enforcement is, in our view, a necessary layer in any sustainable AI compute stack. We continue to upstream generalizable rules through TorchFix and the Triton-lint project so the broader ecosystem benefits. We are extending Citrine in three directions: (1) coverage of emerging hardware and software stacks beyond PyTorch and Triton; (2) tighter integration with LLM coding agents so efficiency knowledge evolves alongside model capabilities; and (3) publishing the attribution methodology so independent researchers and other infrastructure teams can replicate fleet-scale efficiency measurement.

Designing for Human Agency in AI-Augmented Software: A Quality Attribute Approach

Authors: Rohan Vardhan

As artificial intelligence becomes embedded in everyday software, a quiet architectural assumption has taken hold: that human-centeredness is a design philosophy, not a software quality attribute. This article challenges that assumption. Drawing on practitioner experience across multiple organizations and recent research in human-AI interaction, requirements engineering, and software quality models, we argue that existing frameworks, including ISO/IEC 25010:2023, do not adequately capture the properties that make AI-augmented software supportive of human judgment rather than corrosive to it. We propose four quality attributes specific to human-AI systems: Agency Preservation, Calibrated Trust, Contestability, and Learnability of Limits. Together these form the ACCL framework. For each attribute, we present architectural implications and organizational patterns grounded in field observation. We also identify three practitioner-observed failure modes not yet captured in the literature: reflexive bypass, trust debt, and the right-sized explanation problem.

Evaluating Opsmate: Grading AI Incident Investigations at Scale

Authors: Chinmay Gandhi, Narayanan Sankaran, Khushbu Thakur, Ankit Agarwal, Vaidyanathan PK, KC Balsu, Akash Jothi

Opsmate is an autonomous agent that investigates ~1.5k incidents per day at Meta. Getting agents to run is easy; knowing if they did a good job is hard. Evaluating open-ended agent behavior on real production incidents is fundamentally hard - ground truth is expensive to curate, correctness is subjective and multi-dimensional, and agents can inadvertently introduce data leaks

Opsmate Evals solves these challenges by measuring and validating the quality of autonomous incident investigation agents at Meta that is orchestrator and model agnostic. The framework combines automated ground truth synthesis from post-incident artifacts like chat transcripts, fix diffs, and retro docs - with LLM-as-judge graders that score investigations on a calibrated 0-100 scale across root cause accuracy, investigation path logic, and mitigation actionability.

Leading warehouse autonomy with trust

Authors: Sagie Gur-Ari

Meta operates one of the world's largest data warehouses -- millions of tables, tens of thousands of pipelines, and a constantly shifting capacity surface no human team can supervise end to end. We are transforming warehouse operations from reactive alerting to autonomous resolution: an agent control plane that detects, diagnoses, and acts on pipeline and capacity issues at scale, not just routing issues.

Trust is the foundation, not a feature. Every autonomous action is bounded by transparent guardrails, full audit history, provenance on every decision, and human-in-the-loop controls calibrated to stakes. Oversight scales with autonomy -- users always retain visibility and the ability to intervene as the system's authority grows.

LP Planner: Synchronization-Aware Sharding Optimization for Large-Scale DLRM Model Training

Authors: Mine Su Erturk, Greg Macnamara, Caner Gocmen, Isuru Janith Ranawaka, Felicity Liao, Ahmed Shuaibi, Alireza Tehrani, Hammad Ather, Shafeeq Ibraheem, Kaustubh Vartak, Nipun Gupta

LP Planner is a novel optimization tool that generates sharding plans for Deep Learning Recommendation Models (DLRMs) to enable more efficient distributed training. LP Planner formulates the sharding problem as a linear program, jointly optimizing table placement and partitioning strategies across devices and device memory hierarchies to minimize peak memory usage and maximize training throughput.

LP Planner has been deployed to several production models across various PGs, including few of the largest models by training spend. In production, LP Planner delivers up to 8% QPS improvement and 14% peak memory reduction per model launch.

Beyond direct model launches, LP Planner has been integrated into various agentic auto-tuning pipelines saving significant engineer-hours. Ongoing work includes workload-specific critical path modeling, shard type selection, and heterogeneous hardware awareness.

Meta-ML: Supercharging ML Productivity at Scale with AI-Native Infrastructure

Authors: Niv Taiber, Bar Ulman, Ravali Busetty, Aasim Rab, Brant Swidler

Meta operates one of the largest ML Infrastructure stacks, with thousands of Machine Learning Engineers (MLEs) across legacy systems built over years. These engineers navigate a complex landscape of legacy services spanning the full ML lifecycle—from data discovery and feature engineering through model training, experiment tracking, serving, and capacity planning—where frequent context switching introduces significant operational overhead.

In this poster, we'll share how we made this legacy ML infrastructure stack agent-callable at scale without rebuilding it, through Meta-ML—an AI-native infrastructure layer that provides programmatic access through specialized CLI tools and composable skills. We present:

* Progressive Tool Disclosure Architecture — How agents discover relevant capabilities on demand rather than loading all 200+ tools upfront, reducing context window pressure while maintaining broad infra coverage
* Federated Domain-Team Ownership — How 15+ infrastructure teams expose their systems as agent-callable tools, embedding domain expertise directly into the agent interface and how tool coverage is driven by a systematic mapping of MLE jobs-to-be-done across each stage of ML lifecycle
* Production-Grounded Flywheel — How we instrument real MLE sessions to extract evaluation cases grounded in actual workflows—capturing tool sequences, failure modes, and resolution paths that synthetic benchmarks miss and continuously improve our infrastructure.
* Impact at Scale — Meta ML now has over 200+ tools, 7k+ installs, CLI WAU is >8k, 1,600+ WAU with ~59% adoption among power users generating 4M+ tool calls/week; debugging workflows reduced from 1.5 hours to under 10 minutes

We share architectural patterns and operational lessons for making legacy ML infrastructure agent-callable—including progressive tool disclosure, federated domain-team ownership, and production-grounded eval—that transfer to any organization operating ML systems at scale.

Scaling Batch Graph Processing to Trillions of Edges

Authors: Sahil Gandhi, Arneish Prateek

Introducing Meta's new batch graph processing engine, built for the efficient execution of iterative graph algorithms on massive datasets (billions of nodes, trillions of edges). This engine supports various use cases at Meta scale, including LLM pre-training, recommendation systems, data deduplication, and ranking.

The engine tackles critical constraints associated with large-scale batch graph jobs—specifically memory blow-up, skewed shards, network overhead, centralized coordination, and gang-scheduling. It achieves this by reducing costs across data storage, graph partitioning, network communication, engine execution, and distributed coordination. Furthermore, its pluggable architecture accommodates diverse distributed graph processing models.

SparkSentry: Autonomous Error Detection, Triage, and Repair for Spark at Scale

Authors: Varun Srinivas

Meta runs millions of Spark jobs daily across a fleet processing hundreds of petabytes. When failures occur, oncall engineers manually sift through Scuba dashboards, classify errors, file tasks, and investigate root causes — a process that takes hours per incident and doesn't scale with fleet growth.

SparkSentry is an autonomous monitoring pipeline that closes the loop from error detection to code fix with no human intervention. Running twice daily, it ingests thousands of job failures from Scuba, clusters them by root cause using a three-stage approach (deterministic regex matching, TF-IDF + DBSCAN for unknown errors, and LLM semantic merging via Llama 3.1-70B), then analyzes severity using rate-based regression detection scaled to fleet size. For high-severity clusters, SparkSentry creates Phabricator tasks with full investigation context — Scuba links, stack traces, affected users, and codebase-specific investigation instructions — then delegates them to AI agents that autonomously locate the bug in the Spark codebase, write a fix with tests, and submit a diff for review.

Key results: SparkSentry processes ~10K errors per run, produces ~20-50 root cause clusters, and has autonomously generated fixes for error classes affecting thousands of jobs. A multi-layer dedup system (exact cluster ID matching, flexible title matching, normalized title grouping with diff migration) prevents duplicate task creation across runs — reducing oncall noise by eliminating the 7-10 duplicate tasks per error class that manual monitoring produced. The pipeline combines deterministic classification (36+ regex patterns), unsupervised ML clustering, and LLM-powered semantic analysis, with each layer handling failures the previous layer misses.

SparkSentry demonstrates a production-grade pattern for agentic infrastructure monitoring: detect → classify → triage → fix → review, operating continuously without human initiation while maintaining human oversight at the review stage.

The Agent Feed: A Communication Layer for Multi-Agent Collaboration

Authors: Chinmay Gandhi, Vlad Tsvang, Patrick Walsh

When multiple AI agents work on the same problem, their outputs are typically siloed—each producing findings independently with no way to build on each other's work, and no way for humans to separate agent signals from their own discussion.

We built the Agent Feed: a shared communication layer where agents publish findings, read each other's outputs, and collaborate through a standardized API. This enables a consensus-driven investigation model—specialist agents contribute domain-specific evidence, a coordinating agent ranks competing hypotheses against independent data sources, and confidence strengthens as multiple signals corroborate. Humans provide lightweight feedback (thumbs up/down) that closes the loop, shaping which findings get promoted and which agents improve over time.

To scale openly without degrading quality, a tiered channel system ranks agent output by trust level (Core > Validated > Experimental), with self-serve onboarding and quality-gated graduation. Responders always see the highest-confidence findings first.

The pattern generalizes beyond incident response: wherever multiple agents operate on a shared artifact, you need a structured communication layer, a consensus mechanism, and a trust framework.

XStream: Meta's unified stream processing service

Authors: Stella Kaval, Zheng Zhang, Shuo Xu, Jun Fan

Real-time stream processing has become a critical infrastructure at Meta, powering a wide range of products and services including Ads, Reels, Search, Meta AI, and Graph Learning. Over the past decade, Meta developed multiple in-house stream processing systems to support fast-growing business needs. While this approach enabled rapid iteration, it increasingly caused confusion for developers, elevated maintenance burden, and reduced engineering efficiency. To address these challenges, we built XStream, a unified, fully managed stream processing service that consolidates all existing solutions onto a single platform.

XStream has been in production for about six years. It is designed around the principles of ease of authoring, language unification, operational simplicity, high performance, data freshness, and reliability. XStream supports a wide range of applications, including real-time analytics, analysis of distributed traces, and machine learning feature generation. Today, XStream processes data at O(10) TB/s throughput across 70,000+ pipelines running on O(100)k hosts.

This paper presents XStream's architecture and the key design choices behind it. We describe XStream's multi-language authoring layer (SQL, Python, and C++), its vectorized C++ execution engine built on Velox, and its service management platform, Turbine. We deep-dive into the solutions to core technical challenges: exactly-once processing, efficient large-scale shuffling via Scribe-based shuffle, scalable state management with both ZippyDB and the new RocksDB+Manifold state store, and the adoption of the Kappa architecture and Event-level Aggregation Infrastructure (EAI) for improving machine learning feature freshness. We also describe how XStream has evolved over time, including the introduction of the Unified Trigger Framework (UTF) for stateful processing and CodeX for decoupling user logic from engine execution. Finally, we share our production experience scaling XStream to support tens of thousands of pipelines with high availability, reliability, and efficiency, and discuss lessons learned along the way.

SPEAKERS AND MODERATORS

Faisal Siddiqi leads Engineering in AI and Data Infrastructure at Meta, with a focus... read more

Faisal Siddiqi

Meta

Barak Yagour is a Vice President of Engineering at Meta, leading the AI and... read more

Barak Yagour

Meta

Head of Claude Code at Anthropic. read more

Boris Cherny

Anthropic

Jesse Chen is the Director of Product leading the AI for Developer Productivity efforts... read more

Jesse Chen

Meta

Komal Mangtani is a seasoned technology executive with 28 years of experience building and... read more

Komal Mangtani

Meta

Anoop Deoras runs the AI/ML organization for AWS’ Agentic AI business unit. His team... read more

Anoop Deoras

AWS

Ilia Shumailov holds a PhD in Computer Science from the University of Cambridge. Previously,... read more

Ilia Shumailov

Meta

Henry Erskine Crum is Vice President of Product Management for AI for Work at... read more

Henry Eskrine Crum

Meta

Joe Spisak is the VP of Product & Head of Open Source at Reflection... read more

Joe Spisak

Reflection AI

Jessica is a software engineer at Meta and the creator of Claw Town, an... read more

Jessica Fu

Meta

Xing is a Senior Director of Research at Databricks and currently leads the Databricks... read more

Xing Chen

Databricks

Matt Schlicht is the creator of Moltbook, the social network built exclusively for AI... read more

Matt Schlicht

Meta

Qi Guo is a Technical Director and Principle Engineer at Meta, working on the... read more

Qi Guo

Meta

Alex Ratner is the co-founder and CEO at Snorkel AI, and an affiliate assistant... read more

Alex Ratner

Snorkel AI

Sarang Masti Sreeshylan is a Software Engineer at Meta, where he works on ZippyDB... read more

Sarang Masti

Meta

Weiran leads the Stream Processing team at Meta powering real-time data applications in a... read more

Weiran Liu

Meta

Sabastian Mugazambi is a Group Product Manager for Cloud AI Infrastructure at Google, where... read more

Sabastian Mugazambi

Google

Dinkar Pataballa is an Engineering Director at Meta, where he leads Data Experiences &... read more

Dinkar Pataballa

Meta

LATEST NOTES

Systems & Reliability @Scale

06/23/2026

Data Governance in the World of Agents

Komal Mangtani, Can Lin, Projjal Ghosh, and Iuliu Rus A year ago the question about enterprise AI was, “Can we...

UPCOMING EVENT | Systems and Networking

Networking 2026

August 25, 2026 Santa Clara Convention Center, Santa Clara, CA In 2026, @Scale: Networking will continue to focus on the evolution of AI Networking. To address the growing complexity of network operations, we will examine...

UPCOMING EVENT | Mobile, Video and Web

Product 2026

October 28, 2026 Meta Campus, Menlo Park, CA @Scale: Product is an exciting evolution of the @Scale conference series, uniting the best of Product, RTC, Mobile, and Video under a single AI-native theme. We are...

PAST EVENT 06/17/2026 | Data, Machine Learning and AI

AI & Data 2026

June 17, 2026 Meta Campus, Menlo Park, CA Meta’s Engineering and Infrastructure teams are excited to bring together a global contingent of engineers who are interested in building, operating, and using AI and data systems...

PAST EVENT 06/25/2026 | Systems and Networking

Systems & Reliability 2026

June 25, 2026 Meydenbauer Center, Bellevue, Washington Building the advanced infrastructure necessary to power today's sophisticated AI models represents a monumental engineering challenge. This endeavor demands the creation of highly scalable, high-performance, and supremely reliable...