TOPIC: Data, Systems and Networking

Reliability @Scale 2023

SEPTEMBER 26, 2023 @ 10:00 AM PDT - 3:00 PM PDT

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems and products at massive scale. Whether it’s novel design decisions, or outages that impact billions of people, providing reliable experiences for Systems at this scale present unique technical challenges.

Register today and check back for upcoming speaker and agenda announcements!

RSVPS CLOSED
AGENDA SPEAKERS

ABOUT EVENT

Reliability @Scale 2023 will be hosted virtually. Joining us are speakers from AWS, Google, Meta, Microsoft and Netflix. The event will feature talks themed around reactive and proactive reliability strategies for incident management, reliability monitoring, and building strong reliability cultures.

EVENT AGENDA

Event times below are displayed in PT.

September 26

10:00 AM - 10:30 AM
Featured Panel: Reliability A to Z

To truly build resilient systems and products, reliability needs to be incorporated at each layer of the stack. This panel will discuss how to incorporate a “reliability mindset” across all layers of the stack, how this mindset shift has helped Meta grow and scale new products and AI, and how to build a strong culture of reliability within an organization.

Moderator Francois Richard,Meta
Speaker Syamla Bandla,Meta
Speaker Surupa Biswas,Meta
Speaker Chris Malone,Meta
Speaker Jason Kalich,Meta
10:30 AM - 10:50 AM
Evolution of Disaster Recovery @ Meta

Disaster Recovery (DR) is our program to prepare Meta’s infrastructure to handle capacity outages. In this talk, we present our story on how the DR program has evolved. From handling single region failures to new risks on the horizon, like power and multi-region outages. Also, DR strategies such as Power Storms and Site Degradation towards readiness and remediation will also be discussed.

Speaker Ahmed Eid,Meta
Speaker Raghavendra Prabhu,Meta
FEATURED BLOG
THE EVOLUTION OF DISASTER RECOVERY AT META  read more
10:50 AM - 11:10 AM
Large Language Models for Automatic Cloud Incident Management

Building reliable hyper-scale cloud services can be challenging. We need to quickly detect, analyze and mitigate incidents, which largely rely on human effort today. Recent breakthroughs in Large-Language Models (LLMs) have motivated us to explore their potential for automated incident diagnosis. By leveraging LLMs, we aim to accelerate the incident resolution process, leading to improved service reliability and better customer experience. For the first time, we have demonstrated the effectiveness of LLMs in improving cloud reliability. In this talk, we will share our findings, research innovations, and visions in this space.

Speaker Rujia Wang,Microsoft
11:10 AM - 11:30 AM
Building Resilient Monitoring at Meta

Meta’s monitoring infrastructure is responsible for monitoring the health of thousands of systems deployed on millions of heterogeneous, geographically distributed hosts. Monitoring the health of Meta’s infrastructure is crucial to both our users and our business. And, monitoring is especially important during widespread failures. This talk explains the journey of hardening Meta’s monitoring systems to be among our most resilient infrastructure - available when most other systems are degraded (and when we most need monitoring). This session will touch on both engineering culture and the technical strategies (e.g., workload isolation, graceful degradation) and cultural/process strategies (e.g. meetings, tracking) that we leveraged to improve the resiliency of monitoring systems at Meta. Outline Overview of Monitoring @ Meta When Monitoring Fails: What we learned from the Facebook 2021 outage Building Culture: Resiliency as a core value Building Resilient Systems: Know thy enemy Iterative Improvement: Measuring resilience

Speaker Adam Phillabaum,Meta
Speaker David Pariag,Meta
FEATURED BLOG
BUILDING RESILIENT MONITORING AT META  read more
11:30 AM - 11:50 AM
Migrations at Scale: Learnings & Patterns from Zero-Downtime Migrations

Zero-downtime migrations have become an indispensable part of modern software engineering, enabling organizations to smoothly transition complex systems while ensuring uninterrupted operations. In this session, we will deep dive into the intricacies of zero-downtime migrations, delving into the "what, why, and how" behind these seamless transitions at scale.

Speaker Kishore Banala,Netflix
Speaker Anoop Panicker,Netflix
11:50 AM - 12:10 PM
Scribe: Improving Reliability One 9 at a Time

Scribe is at the heart of data transport at Meta. From revenue critical data to important monitoring datasets flow through the system. This sets an extremely high reliability bar for the system. In this talk, we will take you through our journey of adding the 5th 9 to our SLAs. We’ll talk about how we detect incidents before our customers without overloading our oncalls and how we hardened the system against regional and dependency outages.

Speaker Mohamed Bassem,Meta
Speaker Tiziano Carotti,Meta
Speaker Yuri Dolgov,Meta
FEATURED BLOG
SCRIBE: IMPROVING RELIABILITY ONE NINE AT A TIME  read more
12:10 PM - 12:40 PM
Q&A
Moderator Laine Campbell,Meta
Speaker Ahmed Eid,Meta
Speaker Kishore Banala,Netflix
Speaker Anoop Panicker,Netflix
Speaker Mohamed Bassem,Meta
Speaker Tiziano Carotti,Meta
Speaker Yuri Dolgov,Meta
Speaker Adam Phillabaum,Meta
Speaker David Pariag,Meta
Speaker Raghavendra Prabhu,Meta
12:40 PM - 12:55 PM
Break
12:55 PM - 01:15 PM
Improving Reliability Through Data-Driven Engineering & Culture Changes

In the past 2 years, we have leveraged a data-driven strategy to improve the state of system reliability across Monetization. The first step was to quantify the impact of reliability work by defining a longitudinal metric that measures the impact of reliability failures on the business. For the Advertising business at Meta, the negative impact from SEVs can be characterized in terms of advertiser value lost (our systems cannot deliver optimal value for advertisers), which translates to short and long-term revenue lost for the company. Teams across the company then took clear data-driven goals to reduce the negative impact on advertisers of SEVs affecting their systems. These goals were based on data-mining of past SEVs (system outages) that helped identify hot spots actionable in terms of the four levers of reliability (prevention, detection, mitigation, culture)

Speaker David Amsallem,Meta
Speaker Gursharan Singh,Meta
Featured Blog
IMPROVING RELIABILITY THROUGH DATA-DRIVEN ENGINEERING AND CULTURE CHANGES: LEARNINGS FROM MONETIZATION  read more
01:15 PM - 01:35 PM
A Cultural Foundation of Operational Excellence: Amazon’s Mechanisms for Resiliency

Operational excellence is hard but it's impossible without a strong culture. One of Amazon's founding ideals is operational excellence; our store wouldn't work without it and AWS especially demands deep operational excellence. Learn about how Amazon's culture of obsessive, rigorous ownership and our unrelenting focus on building mechanisms & truth-seeking undergird everything we do at AWS -- including and especially handling events and resiliency

Speaker Peter M. O’Donnell,AWS
01:35 PM - 01:55 PM
Building a Culture of Reliability

Building reliability engineering programs requires not only technological approaches, but also attention to the underlying culture of an organization and how it can help or hinder efforts to enhance reliability. This talk explores how and why cultural values, as part of a broader systems analysis, can help drive reliability. Specifically, this talk presents an anthropologist’s perspective on Meta’s journey from a mantra of “move fast and break things” to one where reliability is increasingly becoming an important value for our engineering community. The audience should take away actionable practices that can be used to evaluate their underlying reliability culture, take data-driven approaches to measuring reliability sentiment, and identify practices that align with existing cultural values and solidify actions to bring reliability to the fore.

Speaker Casey Bouskill,Meta
Featured Blog
WHY IS A CULTURAL ANTHROPOLOGIST ON A RELIABILITY ENGINEERING TEAM?  read more
01:55 PM - 02:15 PM
A Reliability Maturity Model Explained

In this presentation, we focus on the organization and the ethos that contribute to the reliability that the organization is capable of sustaining for its products. We describe stages of reliability maturity that an engineering organization can transverse; how to determine where in the continuum of maturity your organization currently falls and provide some ideas that can be used to improve the reliability maturity of the organization. Most importantly, we will provide some thoughts on how to determine the reliability maturity phase the organization requires and what attributes contribute to it. Not all teams need to achieve or maintain the same level of maturity, and knowing what best suits your organization is critically important.

We will add interesting case studies of some products at Google as they transverse this model to find a reliability equilibrium based on what fits by addressing organizational and culture norms.

Speaker Tracy Ferrell,Google
Speaker Vartika Agarwal,Google
02:15 PM - 02:45 PM
Q&A
Moderator Marina Fisher,Meta
Speaker David Amsallem,Meta
Speaker Gursharan Singh,Meta
Speaker Casey Bouskill,Meta
Speaker Tracy Ferrell,Google
Speaker Vartika Agarwal,Google
02:45 PM - 03:00 PM
Closing Remarks

SPEAKERS AND MODERATORS

Francois Richard is Engineering Director responsible for the Reliability Infra at Meta. Reliability Infra... read more

Francois Richard

Meta

Syamla is currently the Senior Director of Production Engineering for Products at Meta which... read more

Syamla Bandla

Meta

Surupa Biswas is the Vice President of Engineering responsible for Core Systems at Meta.... read more

Surupa Biswas

Meta

Chris is part of the Infra Data Center organization where he helps define Meta’s... read more

Chris Malone

Meta

Jason leads the Production Engineering teams globally at Meta as well as the Security... read more

Jason Kalich

Meta

Ahmed is a software engineer working on teams across Meta’s Infrastructure for the past... read more

Ahmed Eid

Meta

Raghavendra (Raghu) D. Prabhu is a Software Engineer in the Reliability Infra org at... read more

Raghavendra Prabhu

Meta

Rujia Wang is a principal research PM at Microsoft, leading the research and product... read more

Rujia Wang

Microsoft

Adam Phillabaum is the Production Engineering Manager for Meta's Monitoring Products, ensuring our monitoring... read more

Adam Phillabaum

Meta

David Pariag is a software engineer focusing on Meta’s Monitoring Products. He’s spent the... read more

David Pariag

Meta

Kishore Banala is a Software Engineer at Netflix, and one of the core contributors... read more

Kishore Banala

Netflix

Anoop Panicker is a Netflix engineer, specializing in the development of large scale distributed... read more

Anoop Panicker

Netflix

Mohamed has been a software engineer at Meta for the past 6 years. His... read more

Mohamed Bassem

Meta

Tiziano is a software engineer with 15 years work experience. He is a versatile... read more

Tiziano Carotti

Meta

Yuri is one of the lead engineers in the Data Infrastructure team at Meta... read more

Yuri Dolgov

Meta

Laine is the Production Engineering Director supporting Core Data, Meta's online data stack. She... read more

Laine Campbell

Meta

David Amsallem is a Data Science Manager at Meta. He has been working in... read more

David Amsallem

Meta

Gursharan works on Ads Data Infrastructure at Meta, and also leads the Ads Reliability... read more

Gursharan Singh

Meta

Peter M. O’Donnell is an AWS Principal Solutions Architect, specializing in security, risk, and... read more

Peter M. O’Donnell

AWS

Bouskill is a senior researcher at Meta where she works on Reliability Engineering. An... read more

Casey Bouskill

Meta

A 10 year Google SRE, now serves as a Product Area SRE lead, has... read more

Tracy Ferrell

Google

A senior Technical Program Manager, now serving as lead in Cloud Security, as a... read more

Vartika Agarwal

Google

Marina is a Production Engineering director in Meta Enterprise Infra and Security group. With... read more

Marina Fisher

Meta

LATEST NOTES

Reliability @Scale
09/26/2023
Scribe: Improving Reliability One Nine at a Time
Why does Scribe need more nines? Scribe is a highly scalable, distributed, durable queue that allows writing and reading large-volume...
Reliability @Scale
09/26/2023
The Evolution of Disaster Recovery at Meta
Overview In the System @Scale 2019, Justin Meza and Shruti Padmanabha spoke about Disaster Recovery at Facebook Scale, describing varying...
Reliability @Scale
09/26/2023
Building Resilient Monitoring at Meta
Overview of Monitoring at Meta Since its inception, Meta (formerly Facebook) has invested heavily in software and hardware infrastructure. Today,...
Reliability @Scale
09/26/2023
Why is a Cultural Anthropologist on a Reliability Engineering Team?
For many of its formative years, Facebook subscribed to the motto “Move fast and break things.” I did not work...
Reliability @Scale
09/27/2023
Improving Reliability through Data-driven Engineering and Culture Changes: Learnings from Monetization
Advertising at Meta Advertising is the primary revenue-generating product for Meta, producing over $113 billion in revenue in 2022, more...
UPCOMING EVENT   May 22, 2024 Data @Scale

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...
UPCOMING EVENT   June 12, 2024 Systems @Scale

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...
UPCOMING EVENT   07/31/2024 AI @Scale

AI Infra @Scale 2024

Meta's Engineering and Infrastructure teams are excited to host AI Infra @Scale, a one-day virtual event featuring a range of speakers from Meta who will unveil the latest AI infrastructure investments and innovations powering Meta's...
UPCOMING EVENT   August 7, 2024 Product @Scale

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. This year focuses on discussions that explore the creator ecosystem, and how AI will play a role in scaling...
UPCOMING EVENT   September 4-5, 2024 (2 day event) Networking @Scale

Networking @Scale 2024

Networking @Scale is a technical conference for engineers that build and manage large-scale networks. Meta’s Networking Infrastructure team is excited to host Networking @Scale, a two-day virtual event featuring a range of speakers from Meta...
UPCOMING EVENT   September 25, 2024 Reliability @Scale

Reliability @Scale 2024

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems and products at massive scale. Whether it’s novel design decisions, or outages that impact billions...
UPCOMING EVENT   October 23, 2024 Mobile @Scale

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...
UPCOMING EVENT   November 20, 2024 Video @Scale

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...
PAST EVENT   March 20, 2024 @ 9am PT - 3pm PT RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy