Reliability @Scale 2023

SEPTEMBER 26, 2023 @ 10:00 AM PDT - 3:00 PM PDT

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems and products at massive scale. Whether it’s novel design decisions, or outages that impact billions of people, providing reliable experiences for Systems at this scale present unique technical challenges.

RSVPS CLOSED

AGENDA SPEAKERS

ABOUT EVENT

Reliability @Scale 2023 will be hosted virtually. Joining us are speakers from AWS, Google, Meta, Microsoft and Netflix. The event will feature talks themed around reactive and proactive reliability strategies for incident management, reliability monitoring, and building strong reliability cultures.

EVENT AGENDA

Event times below are displayed in PT.

September 26

10:00 AM - 10:30 AM

Featured Panel: Reliability A to Z

WATCH NOW

To truly build resilient systems and products, reliability needs to be incorporated at each layer of the stack. This panel will discuss how to incorporate a “reliability mindset” across all layers of the stack, how this mindset shift has helped Meta grow and scale new products and AI, and how to build a strong culture of reliability within an organization.

Moderator Francois Richard,Meta

Speaker Syamla Bandla,Meta

Speaker Surupa Biswas,Meta

Speaker Chris Malone,Meta

Speaker Jason Kalich,Meta

10:30 AM - 10:50 AM

Evolution of Disaster Recovery @ Meta

WATCH NOW

Disaster Recovery (DR) is our program to prepare Meta’s infrastructure to handle capacity outages. In this talk, we present our story on how the DR program has evolved. From handling single region failures to new risks on the horizon, like power and multi-region outages. Also, DR strategies such as Power Storms and Site Degradation towards readiness and remediation will also be discussed.

Speaker Ahmed Eid,Meta

Speaker Raghavendra Prabhu,Meta

FEATURED BLOG

THE EVOLUTION OF DISASTER RECOVERY AT META read more

10:50 AM - 11:10 AM

Large Language Models for Automatic Cloud Incident Management

WATCH NOW

Building reliable hyper-scale cloud services can be challenging. We need to quickly detect, analyze and mitigate incidents, which largely rely on human effort today. Recent breakthroughs in Large-Language Models (LLMs) have motivated us to explore their potential for automated incident diagnosis. By leveraging LLMs, we aim to accelerate the incident resolution process, leading to improved service reliability and better customer experience. For the first time, we have demonstrated the effectiveness of LLMs in improving cloud reliability. In this talk, we will share our findings, research innovations, and visions in this space.

Speaker Rujia Wang,Microsoft

11:10 AM - 11:30 AM

Building Resilient Monitoring at Meta

WATCH NOW

Meta’s monitoring infrastructure is responsible for monitoring the health of thousands of systems deployed on millions of heterogeneous, geographically distributed hosts. Monitoring the health of Meta’s infrastructure is crucial to both our users and our business. And, monitoring is especially important during widespread failures. This talk explains the journey of hardening Meta’s monitoring systems to be among our most resilient infrastructure - available when most other systems are degraded (and when we most need monitoring). This session will touch on both engineering culture and the technical strategies (e.g., workload isolation, graceful degradation) and cultural/process strategies (e.g. meetings, tracking) that we leveraged to improve the resiliency of monitoring systems at Meta. Outline Overview of Monitoring @ Meta When Monitoring Fails: What we learned from the Facebook 2021 outage Building Culture: Resiliency as a core value Building Resilient Systems: Know thy enemy Iterative Improvement: Measuring resilience

Speaker Adam Phillabaum,Meta

Speaker David Pariag,Meta

FEATURED BLOG

BUILDING RESILIENT MONITORING AT META read more

11:30 AM - 11:50 AM

Migrations at Scale: Learnings & Patterns from Zero-Downtime Migrations

WATCH NOW

Zero-downtime migrations have become an indispensable part of modern software engineering, enabling organizations to smoothly transition complex systems while ensuring uninterrupted operations. In this session, we will deep dive into the intricacies of zero-downtime migrations, delving into the "what, why, and how" behind these seamless transitions at scale.

Speaker Kishore Banala,Netflix

Speaker Anoop Panicker,Netflix

11:50 AM - 12:10 PM

Scribe: Improving Reliability One 9 at a Time

WATCH NOW

Scribe is at the heart of data transport at Meta. From revenue critical data to important monitoring datasets flow through the system. This sets an extremely high reliability bar for the system. In this talk, we will take you through our journey of adding the 5th 9 to our SLAs. We’ll talk about how we detect incidents before our customers without overloading our oncalls and how we hardened the system against regional and dependency outages.

Speaker Mohamed Bassem,Meta

Speaker Tiziano Carotti,Meta

Speaker Yuri Dolgov,Meta

FEATURED BLOG

SCRIBE: IMPROVING RELIABILITY ONE NINE AT A TIME read more

12:10 PM - 12:40 PM

Q&A

WATCH NOW

Moderator Laine Campbell,Meta

Speaker Ahmed Eid,Meta

Speaker Kishore Banala,Netflix

Speaker Anoop Panicker,Netflix

Speaker Mohamed Bassem,Meta

Speaker Tiziano Carotti,Meta

Speaker Yuri Dolgov,Meta

Speaker Adam Phillabaum,Meta

Speaker David Pariag,Meta

Speaker Raghavendra Prabhu,Meta

12:40 PM - 12:55 PM

Break

12:55 PM - 01:15 PM

Improving Reliability Through Data-Driven Engineering & Culture Changes

WATCH NOW

In the past 2 years, we have leveraged a data-driven strategy to improve the state of system reliability across Monetization. The first step was to quantify the impact of reliability work by defining a longitudinal metric that measures the impact of reliability failures on the business. For the Advertising business at Meta, the negative impact from SEVs can be characterized in terms of advertiser value lost (our systems cannot deliver optimal value for advertisers), which translates to short and long-term revenue lost for the company. Teams across the company then took clear data-driven goals to reduce the negative impact on advertisers of SEVs affecting their systems. These goals were based on data-mining of past SEVs (system outages) that helped identify hot spots actionable in terms of the four levers of reliability (prevention, detection, mitigation, culture)

Speaker David Amsallem,Meta

Speaker Gursharan Singh,Meta

Featured Blog

IMPROVING RELIABILITY THROUGH DATA-DRIVEN ENGINEERING AND CULTURE CHANGES: LEARNINGS FROM MONETIZATION read more

01:15 PM - 01:35 PM

A Cultural Foundation of Operational Excellence: Amazon’s Mechanisms for Resiliency

WATCH NOW

Operational excellence is hard but it's impossible without a strong culture. One of Amazon's founding ideals is operational excellence; our store wouldn't work without it and AWS especially demands deep operational excellence. Learn about how Amazon's culture of obsessive, rigorous ownership and our unrelenting focus on building mechanisms & truth-seeking undergird everything we do at AWS -- including and especially handling events and resiliency

Speaker Peter M. O’Donnell,AWS

01:35 PM - 01:55 PM

Building a Culture of Reliability

WATCH NOW

Building reliability engineering programs requires not only technological approaches, but also attention to the underlying culture of an organization and how it can help or hinder efforts to enhance reliability. This talk explores how and why cultural values, as part of a broader systems analysis, can help drive reliability. Specifically, this talk presents an anthropologist’s perspective on Meta’s journey from a mantra of “move fast and break things” to one where reliability is increasingly becoming an important value for our engineering community. The audience should take away actionable practices that can be used to evaluate their underlying reliability culture, take data-driven approaches to measuring reliability sentiment, and identify practices that align with existing cultural values and solidify actions to bring reliability to the fore.

Speaker Casey Bouskill,Meta

Featured Blog

WHY IS A CULTURAL ANTHROPOLOGIST ON A RELIABILITY ENGINEERING TEAM? read more

01:55 PM - 02:15 PM

A Reliability Maturity Model Explained

WATCH NOW

In this presentation, we focus on the organization and the ethos that contribute to the reliability that the organization is capable of sustaining for its products. We describe stages of reliability maturity that an engineering organization can transverse; how to determine where in the continuum of maturity your organization currently falls and provide some ideas that can be used to improve the reliability maturity of the organization. Most importantly, we will provide some thoughts on how to determine the reliability maturity phase the organization requires and what attributes contribute to it. Not all teams need to achieve or maintain the same level of maturity, and knowing what best suits your organization is critically important.

We will add interesting case studies of some products at Google as they transverse this model to find a reliability equilibrium based on what fits by addressing organizational and culture norms.

Speaker Tracy Ferrell,Google

Speaker Vartika Agarwal,Google

02:15 PM - 02:45 PM

Q&A

WATCH NOW

Moderator Marina Fisher,Meta

Speaker David Amsallem,Meta

Speaker Gursharan Singh,Meta

Speaker Casey Bouskill,Meta

Speaker Tracy Ferrell,Google

Speaker Vartika Agarwal,Google

02:45 PM - 03:00 PM

Closing Remarks

WATCH NOW

SPEAKERS AND MODERATORS

Francois Richard is Engineering Director responsible for the Reliability Infra at Meta. Reliability Infra... read more

Francois Richard

Meta

Syamla is currently the Senior Director of Production Engineering for Products at Meta which... read more

Syamla Bandla

Meta

Surupa Biswas is the Vice President of Engineering responsible for Core Infrastructure at Meta,... read more

Surupa Biswas

Meta

Chris is part of the Infra Data Center organization where he helps define Meta’s... read more

Chris Malone

Meta

Jason leads the Production Engineering teams globally at Meta as well as the Security... read more

Jason Kalich

Meta

Ahmed is a software engineer working on teams across Meta’s Infrastructure for the past... read more

Ahmed Eid

Meta

Raghavendra (Raghu) D. Prabhu is a Software Engineer in the Reliability Infra org at... read more

Raghavendra Prabhu

Meta

Rujia Wang is a principal research PM at Microsoft, leading the research and product... read more

Rujia Wang

Microsoft

Adam Phillabaum is the Production Engineering Manager for Meta's Monitoring Products, ensuring our monitoring... read more

Adam Phillabaum

Meta

David Pariag is a software engineer focusing on Meta’s Monitoring Products. He’s spent the... read more

David Pariag

Meta

Kishore Banala is a Software Engineer at Netflix, and one of the core contributors... read more

Kishore Banala

Netflix

Anoop Panicker is a Netflix engineer, specializing in the development of large scale distributed... read more

Anoop Panicker

Netflix

Mohamed has been a software engineer at Meta for the past 6 years. His... read more

Mohamed Bassem

Meta

Tiziano is a software engineer with 15 years work experience. He is a versatile... read more

Tiziano Carotti

Meta

Yuri is one of the lead engineers in the Data Infrastructure team at Meta... read more

Yuri Dolgov

Meta

Laine is the Production Engineering Director supporting Core Data, Meta's online data stack. She... read more

Laine Campbell

Meta

David Amsallem is a Data Science Manager at Meta. He has been working in... read more

David Amsallem

Meta

Gursharan works on Ads Data Infrastructure at Meta, and also leads the Ads Reliability... read more

Gursharan Singh

Meta

Peter M. O’Donnell is an AWS Principal Solutions Architect, specializing in security, risk, and... read more

Peter M. O’Donnell

AWS

Bouskill is a senior researcher at Meta where she works on Reliability Engineering. An... read more

Casey Bouskill

Meta

A 10 year Google SRE, now serves as a Product Area SRE lead, has... read more

Tracy Ferrell

Google

A senior Technical Program Manager, now serving as lead in Cloud Security, as a... read more

Vartika Agarwal

Google

Marina is a Production Engineering director in Meta Enterprise Infra and Security group. With... read more

Marina Fisher

Meta

LATEST NOTES

Systems & Reliability @Scale

09/26/2023

Scribe: Improving Reliability One Nine at a Time

Why does Scribe need more nines? Scribe is a highly scalable, distributed, durable queue that allows writing and reading large-volume...

Systems & Reliability @Scale

09/26/2023

The Evolution of Disaster Recovery at Meta

Overview In the System @Scale 2019, Justin Meza and Shruti Padmanabha spoke about Disaster Recovery at Facebook Scale, describing varying...

Systems & Reliability @Scale

09/26/2023

Building Resilient Monitoring at Meta

Overview of Monitoring at Meta Since its inception, Meta (formerly Facebook) has invested heavily in software and hardware infrastructure. Today,...

Systems & Reliability @Scale

09/26/2023

Why is a Cultural Anthropologist on a Reliability Engineering Team?

For many of its formative years, Facebook subscribed to the motto “Move fast and break things.” I did not work...

Systems & Reliability @Scale

09/27/2023

Improving Reliability through Data-driven Engineering and Culture Changes: Learnings from Monetization

Advertising at Meta Advertising is the primary revenue-generating product for Meta, producing over $113 billion in revenue in 2022, more...

past EVENT November 20-21, 2024 | Mobile, Video and Web

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...

PAST EVENT March 20, 2024 @ 9am PT - 3pm PT | Mobile, Video and Web

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

Past EVENT May 22, 2024 | Data, Machine Learning and AI

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...

Past EVENT June 12, 2024 | Systems and Networking

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...

Past EVENT JULY 31, 2024 @ 2:30 PM PDT - 7:00 PM PDT - IN PERSON EVENT | AUGUST 7, 2024 @ 2:30 PM PDT - 5:30 PM PDT - VIRTUAL PROGRAM | Data, Machine Learning and AI

AI Infra @Scale 2024

Meta’s Engineering and Infrastructure teams are excited to return for the second year in a row to host AI Infra @Scale on July 31. This year’s event is open to a limited number of in-person...

Past EVENT August 14, 2024 | Mobile, Video and Web

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. The @Scale community focuses on bringing forward people's experiences in creating innovative solutions to large-scale products serving millions or...

Past EVENT September 11, 2024 | Santa Clara Convention Center | Systems and Networking

Networking @Scale 2024

Meta’s Networking team invites you to Networking@scale on September 11th. This year’s event is an in-person event hosted at the Santa Clara Convention center and will also be live streamed for virtual attendees. Registration is...

Past EVENT October 9, 2024 | Systems and Networking

Reliability @Scale 2024

In the digital age, where systems operate at unprecedented scales, the importance of robust configuration management cannot be overstated. This year’s Reliability @Scale will focus on a central theme of "Move Safely", emphasizing the critical...

Past EVENT October 23, 2024 | Mobile, Video and Web

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...