Reliability @Scale 2024

October 9, 2024

In the digital age, where systems operate at unprecedented scales, the importance of robust configuration management cannot be overstated. This year’s Reliability @Scale will focus on a central theme of "Move Safely", emphasizing the critical role of configuration and code safety in achieving and maintaining system reliability at scale.

This event will be hosted virtually on October 9th. Joining us will be speakers from Amazon, Bloomberg, Fauna, Honeycomb, Meta and Microsoft, who will share innovative strategies, best practices, and cutting-edge tools and processes designed to enhance configuration safety and reliability in large-scale systems.

RSVPS CLOSED
AGENDA SPEAKERS

EVENT AGENDA

Event times below are displayed in PT.

October 9th

09:00 AM - 09:05 AM
Opening Remarks
Speaker Michael Chang,Meta
09:05 AM - 09:45 AM
A Decade of Progress: The Evolution of Production Engineering and Reliability

Join us for a fascinating fireside chat as we explore the evolution of Production Engineering and Reliability over the past decade. Our panelists, including the current and previous VPs of Production Engineering and the Head of Reliability Infrastructure, will share their insights on how the role of Production Engineering has adapted to meet the challenges of scale, complexity, and emerging technologies.
From the early days of ensuring reliability to the present focus on configuration safety and automated checks, our panelists will discuss the key lessons learned and how the role of PEs has helped drive the success Meta has seen over the past decade. They'll also delve into the importance of balancing speed and safety in deployment processes, and how this balance is crucial for maintaining reliability at scale.
As we look to the future, our panelists will share their thoughts on the emerging trends and technologies that will require new approaches to Reliability and Production Engineering, and how we can better collaborate across teams and organizations to improve Reliability.
Join us for an engaging and informative discussion on the evolution of Production Engineering and Reliability at Meta!

Speaker Francois Richard,Meta
Speaker Pedro Canahuati,1Password
Speaker Peter Hoose,Meta
09:45 AM - 10:05 AM
Dependency Safety for Distributed Systems

The end to end journey for every user request originating from our apps traverses hundreds of hops across Meta’s distributed architecture such as client-libraries, privacy and security frameworks, microservices, data stores and hardware. These dependencies are interconnected, making our infra a massive, tightly-coupled graph. We aim to minimize the impact and the duration for dependency outages by making dependency health a core part of reliability at Meta. Dependency safety will enable us to protect the business, improve user experience on our apps and enhance incident response.

Speaker Ankita Vimal,Meta
Speaker Antonio De La Vega,Meta
Featured Article
Untangling Dependencies @ Scale  read more
10:05 AM - 10:25 AM
Improving Cloud Reliability at Scale Using Gen AI

Building and operating reliable hyper-scale cloud services requires a significant amount of domain knowledge and human effort. Generative AI has been proven to be effective for specialized domains including software engineering tasks like code authoring. However, leveraging vanilla LLMs for specialized tasks like Incident management is not feasible due to the lack of domain knowledge and relevant context. In this talk, I will present our research and findings from designing and deploying a multi-tiered framework using LLMs for end-to-end diagnosis of production incidents across Microsoft. I will also present our framework, AIOpsLab, aimed at developing and evaluating agents for Cloud Ops for improving resiliency of cloud services in a principled manner.

Speaker Chetan Bansal,Microsoft
10:25 AM - 10:45 AM
Reliability, Innovation, Time to Market: Choose All Three

On a daily basis, engineers at hyper-scale companies build cutting-edge products that are quickly shipped to massive user bases. Does operating this way always require compromises in reliability? In this talk, Justin will discuss how lessons he learned from his experience working in aviation — where reliability is not optional — are applied at Meta to achieve high reliability along with innovation and short time to market.

Speaker Justin Gibbs,Meta
Featured Article
Reliability, Innovation, Time to Market: Choose All Three  read more
10:45 AM - 11:05 AM
9 SLIs; Oh My!

After years of working and coaching teams to implement SLOs, it’s becoming incredibly clear to me that the greatest challenge that engineering and product teams face is finding the right SLIs. SLOs are hard to get right, and it generally takes time and multiple iterations to tweak, tune, and adjust them so they’re providing value to inform when we need to take action to defend the reliability of our systems. Collectively all our teams want to release at a time when it’s safe to do so, have awareness when it’s necessary to roll back, or when to restore data. However there is an underlying assumption that the SLI itself is/has been providing value.

As hard as SLOs are to get right, thinking of a good SLI is also difficult. This especially complicates things for engineering teams that don’t have a product person. As a result, they often struggle to identify what are key user / customer journeys. This talk will attempt to provide attendees with additional guidance to help them think more clearly about and create better SLIs.

Speaker Sal Furino,Bloomberg
11:05 AM - 11:30 AM
Live Q&A Session #1
Moderator Joe Gasperetti,Meta
Speaker Ankita Vimal,Meta
Speaker Antonio De La Vega,Meta
Speaker Chetan Bansal,Microsoft
Speaker Justin Gibbs,Meta
Speaker Sal Furino,Bloomberg
11:30 AM - 11:35 AM
Break
11:35 AM - 11:55 AM
Making Deployments Safe at Meta

Outages at Meta prevent billions of people around the world from communicating with each other. We are constantly striving to improve the reliability of our products and systems to ensure they are functioning as expected. We’ll dive into the critical role of deployment health checks in enhancing the reliability of thousands of systems. We’ll share strategies around keeping a high bar for change safety while minimizing noise and raising trust in the deployment process. Gain insights into our vision and ongoing efforts to bolster infrastructure reliability.

Speaker Christopher Hegre,Meta
Speaker Anton Korenkov,Meta
Featured Article
Making Deployments Safe at Meta  read more
11:55 AM - 12:15 PM
Applying Continuous Delivery to Database Schema: Reduce Risk & Accelerate by Pipelining All The Things

While Continuous Delivery (CD) has revolutionized application deployment, database schema changes often remain a manual, high-risk process. This talk explores how to extend CD practices to schema management in databases, reducing risk and accelerating delivery. Tyson Trautmann, VP of Engineering at Fauna, illustrates the challenges of traditional schema change processes and presents strategies for implementing Continuous Schema Delivery.

Attendees will learn about the unique challenges that data and associated schema present for CD, essential requirements for successful implementation, and practical techniques for integrating schema changes into CI/CD pipelines. The talk covers version control for schemas, zero-downtime migration techniques, and automated testing strategies. Tyson also demonstrates Fauna's approach to schema management, which includes progressive schema enforcement, schema as code capabilities, and zero-downtime migrations, supports implementing CD best practices as database schema evolves.

This talk provides valuable insights for engineers looking to reduce risk and increase delivery speed by bringing their database schema into their CD workflow. Learn how to pipeline all the things — including your schema changes — for a more robust and efficient development process.

Speaker Tyson Trautmann,Fauna
12:15 PM - 12:35 PM
Limiting the Blast Radius: Preventing outages from bad configs with datacenter-scale health checks

Configuration changes can easily be catastrophic, with the potential to create broad, instantaneous system outages. We use datacenter-scale health metrics to validate configuration changes before they deploy to all of production. By adopting this validation step broadly at Meta, we have been able to prevent several major incidents. In addition, we use the signature generated by single-datacenter deployments to quickly root cause many other incidents. This talk will delve into the technique of region-scale health checks, the successes achieved, and our ongoing work in the space to prevent future incidents.

Speaker Zach Zundel,Meta
Speaker Joe Romano,Meta
Featured Article
Limiting the Blast Radius: Preventing Site Outages with Data Center-Scale Health Checks  read more
12:35 PM - 01:00 PM
Live Q&A Session #2
Moderator Joe Gasperetti,Meta
Speaker Christopher Hegre,Meta
Speaker Anton Korenkov,Meta
Speaker Tyson Trautmann,Fauna
Speaker Zach Zundel,Meta
Speaker Joe Romano,Meta
01:00 PM - 01:05 PM
Break
01:05 PM - 01:25 PM
Building & Running a Scuba-Inspired Observability Tool

Charity, Christine, Ben, and Ian began using Scuba at Facebook in 2012 in order to diagnose complex problems with the multi-tenant systems of the Parse acquisition. The columnar, in-memory data store, despite being fronted by a user-hostile UI, was lightning-fast and completely unlike the traditional log analytics or metrics TSDB systems that they'd used before. Upon leaving in 2016, they created Honeycomb to enable teams at non-FAANG companies to benefit from the modern approach to analytics and observability they'd seen at Facebook. In this talk, you'll learn about how Honeycomb's columnar datastore, named Retriever, uses commodity blob storage and serverless functions to achieve the same kind of fast iteration speed, and is coupled with an intuitive user interface.

Speaker Liz Fong-Jones,Honeycomb
Speaker Ben Hartshorne,Honeycomb
01:25 PM - 01:45 PM
Safe Deployment: Exploring Methodologies and Tools That Ensure Fast, Safe and Error-Free Code and Configuration Changes

In our fast-paced network environment, velocity and safety are often at odds with each other. However, the devastating consequences of the access_denied SEV0 highlighted the need for a new approach to change management. In response, Netinfra has adopted a revolutionary Safe Deployment strategy that prioritizes both speed and reliability. Join us as we explore the evolution of change safety management for over 300+ network devices in Meta datacenters and delve into the methodologies and tools that underpin Safe Deployment. We'll discuss how automated testing, network simulation, and controlled rollouts come together to minimize the risk of network disruptions and downtime. We'll also examine the benefits and technical challenges of Netinfra's centralized safety service, which streamlines safety checks and ensures consistency across all network operations and changes. Take away valuable insights on balancing velocity and safety in your network environment and learn from Netinfra's experiences in creating a more reliable and secure network infrastructure.

Speaker Abhinav Sharma,Meta
Speaker Francisco Bautista,Meta
Featured Article
Safe Change Management in Meta Data Centers  read more
01:45 PM - 02:05 PM
Resilience Is Not Enough

Resilience is important, but it's not enough. Even the most robust systems may face failures and outages at some point. In this talk, Joe will explore the critical importance of building recoverable systems - ones that don't just withstand disruptions, but can be recovered quickly and predictably, even in the face of the most complex failures.

Speaker Joe Magerramov,Amazon
02:05 PM - 02:25 PM
Scaling Releases: Inside Meta's WWW Release Operations

In this talk, we will explore the key components that make Meta's WWW release process sustainable, effective, and robust in the face of rapid growth. Meta's competitive advantage lies in its reliable and frequent releases, a process that has been continuous since 2017. This process has not only stood the test of time but has also evolved to meet the demands of an expanding organization.

Speaker Vladimirs Kotovs,Meta
Speaker Casey McGinty,Meta
Featured Article
Scaling Releases: Inside Meta WWW Release Operations  read more
02:25 PM - 02:50 PM
Live Q&A Session #3
Moderator Joe Gasperetti,Meta
Speaker Liz Fong-Jones,Honeycomb
Speaker Ben Hartshorne,Honeycomb
Speaker Abhinav Sharma,Meta
Speaker Francisco Bautista,Meta
Speaker Joe Magerramov,Amazon
Speaker Casey McGinty,Meta
Speaker Vladimirs Kotovs,Meta
02:50 PM - 02:55 PM
Closing Remarks
Speaker Michael Chang,Meta

SPEAKERS AND MODERATORS

Michael Chang works on Fault Tolerance within Meta Infrastructure. For the last four years,... read more

Michael Chang

Meta

Francois Richard is Engineering Director responsible for the Reliability Infra at Meta. Reliability Infra... read more

Francois Richard

Meta

Pedro Canahuati is the chief technology officer (CTO) of 1Password. Prior to 1Password, Pedro... read more

Pedro Canahuati

1Password

Peter Hoose is the head of Production Engineering at Meta. PE is a unique... read more

Peter Hoose

Meta

Ankita Vimal is a software engineer in the Reliability Infra org at Meta. She... read more

Ankita Vimal

Meta

Antonio has been a production engineer at Meta for the past 5 and a... read more

Antonio De La Vega

Meta

Chetan Bansal is a Senior Principal Research Manager at Microsoft. He works on building... read more

Chetan Bansal

Microsoft

Justin has spent the last 9 years working on distributed infrastructure systems at Meta.... read more

Justin Gibbs

Meta

Sal Furino is a Customer Reliability Engineer at Bloomberg. During his career he’s worked... read more

Sal Furino

Bloomberg

Joe is currently a production engineer on Meta’s Reliability Engineering initiative. Over the last... read more

Joe Gasperetti

Meta

Christopher Hegre joined Meta in 2017 as a front end software engineer. Before joining... read more

Christopher Hegre

Meta

Anton is a software engineer working on Monitoring & Observability at Meta. Currently, he... read more

Anton Korenkov

Meta

Tyson Trautmann, Fauna's VP of Engineering, is a seasoned technology leader with a passion... read more

Tyson Trautmann

Fauna

Zach is a Production Engineer at Meta who works on platform reliability and incident... read more

Zach Zundel

Meta

Joe Romano is a software engineer working on deployment products for code, config, and... read more

Joe Romano

Meta

Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE)... read more

Liz Fong-Jones

Honeycomb

Ben Hartshorne’s journey has been from racking servers and aggregating metrics in RRDs to... read more

Ben Hartshorne

Honeycomb

Abhinav Sharma is a production engineer with over 5 years of experience at Meta,... read more

Abhinav Sharma

Meta

Francisco is currently a Production Engineer at Meta where He works with the team... read more

Francisco Bautista

Meta

Joe is a Distinguished Engineer at AWS. He is a builder, who enjoys building... read more

Joe Magerramov

Amazon

Vladimirs is a passionate software engineer with broad experience in areas ranging from middleware... read more

Vladimirs Kotovs

Meta

Casey McGinty has been a Senior Software Engineer on the Release Engineering team at... read more

Casey McGinty

Meta

LATEST NOTES

Reliability @Scale
10/09/2024
Making Deployments Safe At Meta
At Meta, ensuring the reliability of products and services is a top priority. Achieving our company mission of giving people...
Reliability @Scale
10/09/2024
Untangling Dependencies @ Scale
Meta turned 20 this year. And in this time, the company has grown from one product to multiple user-facing apps,...
Reliability @Scale
10/09/2024
Safe Change Management in Meta Data Centers
In this blog post, we will explore the evolution of change-safety management for the hundreds of thousands of network devices...
Reliability @Scale
10/09/2024
Reliability, Innovation, Time to Market: Choose All Three
‘IFR in IMC over SLI into LGB on the VOR and ILS in a 172 photo D Ramey Logan.jpg’ from...
Reliability @Scale
10/09/2024
Limiting the Blast Radius: Preventing Site Outages with Data Center-Scale Health Checks
Motivation/History Configerator is a system that rapidly distributes service configuration files across the server fleet. Changes are distributed to all...
Reliability @Scale
10/31/2024
Scaling Releases: Inside Meta WWW Release Operations
Scaling Releases: Inside Meta WWW Release Operations Vladimirs Kotovs and Casey McGinty As Meta continues to grow and evolve, one...
past EVENT   November 20-21, 2024 | Video @Scale

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...
PAST EVENT   March 20, 2024 @ 9am PT - 3pm PT | RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...
Past EVENT   May 22, 2024 | Data @Scale

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...
Past EVENT   June 12, 2024 | Systems @Scale

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...
Past EVENT   JULY 31, 2024 @ 2:30 PM PDT - 7:00 PM PDT - IN PERSON EVENT | AUGUST 7, 2024 @ 2:30 PM PDT - 5:30 PM PDT - VIRTUAL PROGRAM | AI Infra @Scale

AI Infra @Scale 2024

Meta’s Engineering and Infrastructure teams are excited to return for the second year in a row to host AI Infra @Scale on July 31. This year’s event is open to a limited number of in-person...
Past EVENT   August 14, 2024 | Product @Scale

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. The @Scale community focuses on bringing forward people's experiences in creating innovative solutions to large-scale products serving millions or...
Past EVENT   September 11, 2024 | Santa Clara Convention Center | Networking @Scale

Networking @Scale 2024

Meta’s Networking team invites you to Networking@scale on September 11th. This year’s event is an in-person event hosted at the Santa Clara Convention center and will also be live streamed for virtual attendees. Registration is...
Past EVENT   October 9, 2024 | Reliability @Scale

Reliability @Scale 2024

In the digital age, where systems operate at unprecedented scales, the importance of robust configuration management cannot be overstated. This year’s Reliability @Scale will focus on a central theme of "Move Safely", emphasizing the critical...
Past EVENT   October 23, 2024 | Mobile @Scale

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy