@Scale: AI & DATA

June 25, 2025

Location: Santa Clara Convention Center
5001 Great America Pkwy, Santa Clara, CA 95054

Meta’s Engineering and Infrastructure teams are excited to bring together a global contingent of engineers who are interested in building, operating, and using AI and data systems at scale.

This year, we will focus on building a world in which Agents interact with billions of users, a critical step towards unlocking the full potential of AI and data systems. Our in-person talks and panels will delve into the latest advancements in agent development, deployment, and product integration, featuring expert insights on topics such as data for agents, agent tools & environments, safety, and privacy. Attendees can expect to gain practical knowledge and strategies for building AI-powered products, as well as a deeper understanding of the evolving ecosystem and its implications for traditional BI and product analytics.

In addition to our in-person talks and panels, our poster session will showcase a wide range of topics relevant to product analytics, exploring market opportunities, understanding trends, making better decisions, and ensuring the products and systems they build run robustly and reliably at scale. The conference will foster open discussion and collaboration across the industry, and highlight open source solutions that can be leveraged as a foundation for others to build on.

Register to join us in person and be entered for a chance to win a pair of Ray-Ban Meta Wayfarers!

AGENDA SPEAKERS

EVENT AGENDA

Event times below are displayed in PT.

June 25

08:30 AM - 10:00 AM
Attendee Registration
08:30 AM - 10:00 AM
Breakfast, Raffle Submissions, Networking
10:00 AM - 10:10 AM
Opening Remarks
Speaker Aparna Ramani,Meta
10:10 AM - 10:35 AM
Fireside Chat
Speaker Aparna Ramani,Meta
Speaker Ion Stoica,AnyScale
10:35 AM - 10:55 AM
Building Agents - Meta AI

Presentation information coming soon!

Speaker Hussein Mehanna,Meta
10:55 AM - 11:15 AM
Beyond RAG: Production-Ready AI Agents Powered by Enterprise-Scale Data

At Snowflake, we’ve been pursuing how to use Agents and AI to let any business user talk to their data. Enterprise data isn’t always tidy—and AI agents need more than great retrieval to drive real value. In this session, we’ll share what we’ve learned at Snowflake about enabling agents that deeply understand and reason over structured business data. We’ll cover challenges like navigating messy schemas, generating trustworthy SQL, ensuring consistency in definitions like “revenue,” and making the agent’s process visible to non-technical users. Whether you’re scaling agent use across departments or starting to integrate them into core business workflows, you’ll leave with strategies to make agents effective, reliable, and trusted partners in the enterprise.

Speaker Jeff Hollan,Snowflake
11:15 AM - 11:35 AM
Agentic Evals
Speaker Shirshir Patil,Meta
11:35 AM - 11:55 AM
Silent Errors in Large-Scale LLM training: Challenges and Lessons Learned

GPU cluster reliability is a growing challenge as AI models and the clusters that host them grow to unprecedented scale. Insidious errors such as Silent Data Corruptions (SDCs) are particularly difficult to address due to their highly elusive and non-deterministic nature, and their effect on large-scale LLM training and inference is poorly understood. In this talk, we will present how NVIDIA is leveraging its deep expertise in GPUs and AI to holistically tackle this challenge from silicon to data centers. We will go over the work we are doing to improve our understanding of these complex errors and their effect in real world at-scale AI cluster deployments, and the solutions we are developing to help researchers, cluster builders, and the industry protect against SDCs.

Speaker Cyril Meurillon,NVIDIA
Speaker Devin O’Kelly,NVIDIA
11:55 AM - 12:20 PM
Q&A Session
12:20 PM - 01:20 PM
Lunch & Poster Sessions
01:20 PM - 01:50 PM
Live Panel: GenAI Startups

We will explore the world of reinforcement learning, post training and what it’s like to build a startup on open models.

Moderator Joe Spisak,Meta
Panelist Horace He,Thinking Machines
Panelist Dhruv Batra,Yutori
Panelist Sal Candido,Evolutionary Scale AI
Panelist Eugen Hotaj,Perplexity
Panelist Carina Hong,Axiom
01:50 PM - 02:15 PM
How to Prepare Your Agents for the Ice(berg) Age

In a future world where agents interact with billions of users, many of these agents will also have to interact with data querying tools to provide answers grounded in facts. As enterprise data analytics is rapidly moving towards open table formats like Apache Iceberg, these agents need to be able to speak to Iceberg-based data. Once agents can speak Iceberg, they gain an advantage - they become portable. They can run in public clouds, locally on a laptop (during development), or on-premise, accessing enterprise data that can't be moved to public clouds. Portability is important because, while Nvidia GPUs dominate in the cloud, the GPU stack looks different on-premises and on consumer hardware. In this talk, we will discuss how Apache Iceberg tooling and portable application runtimes make agents grounded in facts and enable them to run across different GPU stacks and deployment models.

Speaker Serhii Sokolenko,Tower.dev
02:15 PM - 02:40 PM
Agentic Solution for Data Warehouse Access

Meta manages a large-scale data warehouse where security is a critical component. Every day, teams across Meta are tasked with managing access to the data they oversee and obtaining access to data through internal data products. In this talk, we delve into the challenges of managing internal data access at Meta's scale and its growing complexity. We will also share how we developed an agentic solution to empower both data users and data owners in addressing these challenges.

Speaker Can Lin,META
Speaker Uday Ramesh Savagaonkar,Meta
02:40 PM - 03:05 PM
Break & Poster Sessions
03:05 PM - 03:30 PM
Agentic Observability - Making LLM Apps Debuggable, Trustworthy, and Scalable

As LLM applications evolve into multi-agent systems and power complex decision-making workflows, the ability to observe and debug their behavior becomes a core engineering challenge. These systems are dynamic, non-deterministic, and increasingly reliant on external tools and APIs making traditional monitoring approaches insufficient. At Fiddler, we've worked with enterprise and federal teams deploying LLMs at scale, and what we’ve consistently seen is the absence of effective observability creates blind spots that delay iteration and introduce risk. In this talk, we will introduce Agentic Observability, a set of techniques and infrastructure to monitor production LLM systems. We will walk through how we trace agent reasoning and tool usage in structured form, apply Fast Trust Models to evaluate output quality beyond token-level accuracy, and monitor shifts in behavior using statistical and embedding-based methods. We will also share how we enable integration testing for agent workflows by simulating decision paths and validating semantic intent, all while operating under the scale and latency constraints of modern AI stacks. This work bridges AI science, platform engineering, and real-world GenAI deployment. We will highlight engineering lessons learned from high-scale environments, and how these observability tools are helping teams move faster, catch failures earlier, and build AI systems that can be trusted in production.

Speaker Krishna Gade,Fiddler
03:30 PM - 03:55 PM
Bringing AI Into the Real World

Join us as we delve into the high-level design and architecture that enabled the creation of Ray-Ban Meta, a best-in-class AI wearable used by millions. To make AI truly useful, it must be seamlessly integrated into our daily lives, providing reliable and high-performance capabilities. However, achieving this requires overcoming real-world physical limitations through innovative engineering and model design.

In this talk, we'll explore how a user-centric approach drove the development of a complex architecture that harmoniously brings together multiple components to meet the unique needs of our users. From running models directly on the frames to optimizations informed by user behavior, we'll break down the key elements that have made Ray-Ban Meta a game-changer in the world of AI wearables.

Speaker Alexandru Petrescu,Meta
03:55 PM - 04:25 PM
Live Panel: "Infrastructure in an Agentic World"

If the future is agentic, what does this mean for Infrastructure?

Moderator Karthik Lakshminarayanan,Meta
Panelist Barak Yagour,Meta
Panelist Anna Berenberg,Google
Panelist Barr Moses,Monte Carlo
Panelist Qi Ke,Microsoft
04:25 PM - 04:30 PM
Closing Remarks
Speaker Barak Yagour,Meta
04:30 PM - 06:00 PM
Happy Hour & Poster Sessions

SPEAKERS AND MODERATORS

Aparna is VP Engineering at Meta, responsible for AI Infrastructure, Data Infrastructure and Developer... read more

Aparna Ramani

Meta

Ion Stoica is a Professor in the EECS Department at the University of California... read more

Ion Stoica

AnyScale

I’m a technology executive with a deep background in building AI systems—from physical autonomy... read more

Hussein Mehanna

Meta

Jeff is the Director of Product for AI Agents and Applications at Snowflake. His... read more

Jeff Hollan

Snowflake

Shishir is a Research Scientist on the Llama post-training team, where he led the... read more

Shirshir Patil

Meta

Cyril Meurillon is a software engineer at NVIDIA, where he covers resiliency. His work... read more

Cyril Meurillon

NVIDIA

Devin O'Kelly is a Senior HPC Engineer at NVIDIA where he focuses on fleet... read more

Devin O’Kelly

NVIDIA

Joe Spisak is Product Director and Head of Open Source in Meta’s Generative AI... read more

Joe Spisak

Meta

Bio: Horace is interested in making both researchers and GPUs happy. He currently works... read more

Horace He

Thinking Machines

Dhruv Batra is a co-founder and the Chief Scientist of Yutori. Previously, he was... read more

Dhruv Batra

Yutori

Co-founder and CTO of Evolutionary Scale AI, former UTL at Meta and Google X. read more

Sal Candido

Evolutionary Scale AI

Member of technical staff at Perplexity, Former Llama post training at Meta. read more

Eugen Hotaj

Perplexity

Founder & CEO of Axiom, former Stanford Phd. read more

Carina Hong

Axiom

Serhii Sokolenko is the CEO and co-founder of Tower.dev, a hassle-free platform for data... read more

Serhii Sokolenko

Tower.dev

Can Lin is a software engineer in the AI & Data Infrastructure Responsibility area... read more

Can Lin

META

Uday Ramesh Savagaonkar

Meta

Krishna Gade is the Founder/CEO of Fiddler.AI, a Model Performance Monitoring startup. Prior to... read more

Krishna Gade

Fiddler

Alexandru Petrescu has been a Software Engineer at Meta for the past 12 years,... read more

Alexandru Petrescu

Meta

Karthik Lakshminarayanan is a Product Management Director at Meta read more

Karthik Lakshminarayanan

Meta

Barak Yagour is the Vice President of Engineering at Meta, leading the Data Infrastructure... read more

Barak Yagour

Meta

Anna Berenberg is an Engineering Fellow and Uber Tech Lead for GCP as Platform.... read more

Anna Berenberg

Google

Barr Moses is CEO & Co-Founder of Monte Carlo, a data and AI observability... read more

Barr Moses

Monte Carlo

Corporate Vice President, Cloud + AI Qi Ke leads several key initiatives within Microsoft... read more

Qi Ke

Microsoft

2025 Events

@Scale is a technical conference series for engineers who build or maintain systems designed for scale. New this year, in person and virtual attendance options will be available at all four of our programs, which will bring together complementary themes to create event communities to spark cross-discipline collaboration.

AI & DATA - JUNE 25, 2025

Meta’s Engineering and Infrastructure teams are excited to bring together a global contingent of engineers who are interested in building, operating, and using AI and data systems at scale.

This year, we will focus on building a world in which Agents interact with billions of users, a critical step towards unlocking the full potential of AI and data systems. Our in-person talks and panels will delve into the latest advancements in agent development, deployment, and product integration, featuring expert insights on topics such as data for agents, agent tools & environments, safety, and privacy. Attendees can expect to gain practical knowledge and strategies for building AI-powered products, as well as a deeper understanding of the evolving ecosystem and its implications for traditional BI and product analytics.

Register today and learn how you can win a pair of Ray-Ban Meta Wayfarers!

NETWORKING - AUGUST 13, 2025

Hosted In Person & Virtually
Santa Clara Convention Center

In 2025, @Scale: Networking will continue to focus on the evolution of AI Networking. To address the growing complexity of network operations, we will examine a full-stack perspective towards debugging, encompassing the communications layer through to the hardware. By adopting a holistic approach across both our font-end and back-end networks we can identify and mitigate potential bottlenecks, ensuring optimal network performance. You will also hear from industry experts and leading researchers who are at the forefront of building large scale networks. Attendees will benefit from the opportunity to learn about diverse approaches to solve common challenges and explore potential collaborations.

Register today and stay tuned for upcoming agenda announcements.

PRODUCT - OCTOBER 22, 2025

Hosted In Person & Virtually
Meta Campus, Menlo Park

@Scale: Product is an exciting evolution of the conference series, bringing together the best of Product @Scale, RTC @Scale, Mobile @Scale, and Video @Scale. This comprehensive program is designed for engineers who are passionate about building and optimizing large-scale products. Attendees will gain insights into the latest innovations, best practices, and tools that drive efficiency and performance across product development, real-time communication, mobile platforms, and video technologies.

Register today and stay tuned for upcoming agenda announcements.

SYSTEMS & RELIABILITY - PAST EVENT

Hosted In Person & Virtually
Meta Campus, Menlo Park

The first installment of the 2025 @Scale conference series will combine two of the most foundational topics across the stack, Systems & Reliability. This two-track program will feature technical talks about the demands of AI and the conference theme of "rising to the challenge." The themed talks will include compelling stories about solving the hardest hyper-scale problems with distributed systems, infra resilience and many more complex challenges by speakers from around the industry.

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy