Reliability @Scale Summer 2022

AUGUST 31, 2022 @ 10:00 AM PDT - 3:15 PM PDT

Designed for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges.

RSVPS CLOSED

AGENDA SPEAKERS

ABOUT EVENT

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems at massive scale. Whether it’s novel design decisions, or outages that impact billions of people, providing reliable experiences for Systems at this scale present unique technical challenges. The @Scale community focuses on bringing people together to discuss these challenges and collaborate on the development of new solutions.

Reliability @Scale will be hosted virtually. Joining us are speakers from Akamai, Cloudflare, Fastly, Google, Meta, and Roblox. The event will be hosted on August 31, 2022 with talks themed around large-scale outages, incident response and learnings, and measuring reliability at scale.

EVENT AGENDA

Event times below are displayed in PT.

August 31

10:00 AM - 10:15 AM

How We Drained Every Backbone Router Simultaneously

WATCH NOW

Presented by: Santosh Janardhan (Meta)

Speaker SANTOSH JANARDHAN,Meta

10:15 AM - 10:40 AM

Lessons Learned from the Halloween Outage

WATCH NOW

In this talk, VP of Engineering Max Ross will discuss the 73 hour outage that impacted Roblox late last year. He will also share some of the ways that a multi-day outage can turn conventional reliability wisdom on its head.

Speaker Max ,Roblox

10:40 AM - 10:55 AM

QUIC Exit: Exposing a New Class of Outage

WATCH NOW

A crash bug in QUIC handshake code exposed a new class of bugs we termed ‘contagion bugs’. For these bugs, a tiny number of tasks can cause a huge outage and rollbacks don’t work as expected. This talk details what contagion bugs are, discusses the details of the outage, and what we did to prevent and mitigate them going forward.

Speaker Ian Swett,Google

10:55 AM - 11:10 AM

Service Incident Deep Dive: Technical Overview & Learnings

WATCH NOW

This talk will provide a technical overview of a service incident on the Akamai platform in July 2021 which, despite layers of safety technologies, nevertheless impacted some of Akamai’s customers. In addition to exploring the technical underpinnings of the incident, there will be discussion of lessons learned and actions taken to broadly reduce the risk of recurrence.

Speaker James Kretchmar,Akamai

11:10 AM - 11:25 AM

Lessons From Long-Running Investigations

WATCH NOW

In this talk, we share some lessons from several of our long-running investigations. Some of them are well-known, but are worth repeating, and some of them are things we learned and want to share.

Speaker Jana Iyengar,Fastly

Speaker Hossein Lotfi,Fastly

11:25 AM - 11:45 AM

AWS Infrastructure: Engineering for Resiliency at Scale

WATCH NOW

Presented by: Prasad Kalyanaraman (AWS)

Speaker Prasad Kalyanaraman,AWS

11:45 AM - 12:00 PM

Pipefail Overview and Discussion

WATCH NOW

Presented by: Jeremy Hartman (Cloudflare)

Speaker Jeremy Hartman,Cloudflare

12:00 PM - 12:45 PM

Live Panel

WATCH NOW

All Panelists + Moderated by Anca Agape (Meta)

Speaker Anca Agape,Meta

12:45 PM - 12:55 PM

Break

12:55 PM - 01:15 PM

Improving Reliability @ Meta: By Analyzing Historical Events That Led to SLO Violations

WATCH NOW

Learn about culture of tracking Service Level Indicators/Service Level Objectives at Instagram specifically and Meta in general, the tools that we use and how teams' SLI/SLO workflows can be improved by annotating SLO violations and analysing them later. In the talk we will briefly cover history of SLI/SLO tracking at Meta, then talk about how Instagram team used data annotations to tackle some of the reliability issues they had and how we're expanding this approach to the whole company.

Speaker Kostiantyn Tsaregradskyi,Meta

Speaker Keshav Varma,Meta

01:15 PM - 01:40 PM

Service Degradation at Scale: Creating Instant Capacity

WATCH NOW

We will talk about what factors made us realize that service degradation is necessary for our infrastructure and the challenges we faced while implementing service degradation at scale. We will also speak about how we are changing our Fault Tolerance Strategy to use service degradation instead of provisioning extra buffer.

Speaker Thote Gowda,Meta

Speaker Yi Yu,Meta

01:40 PM - 02:10 PM

Shrinking the Impact of Production Incidents

WATCH NOW

Shrinking Production Incidents details an organized approach for reducing the overall impact of production outages.

Attendees can expect to learn how to prioritize reliability-related engineering tasks based on incident postmortem data, focusing on tasks that:

Reduce time to detection of the incident
Shorten the time to repair
Expand the time between failures

Speaker Yuri Grinshteyn,Google

02:10 PM - 02:30 PM

Reliably Changing Configuration @ Scale

WATCH NOW

Thousands of services at Meta use Configuration Management, so it is important we change that configuration reliably. Tune in for a story spanning several years, covering how we exponentially grew coverage of a protection mechanism for one our most critical developer workflows. Along the way, we'll dive into some specifics of challenges we faced and overcame to reliably change configuration at scale.

Speaker Avery Berchek,Meta

02:30 AM - 02:45 AM

Meta's SEV Culture: How Today's SEVs Create Tomorrow's Reliability

WATCH NOW

Would you believe us if we said the more SEVs we have, the more reliable we are? In this talk we'll talk about the reasons why we love SEVs at Meta, and how our culture around SEVs has allowed us to build reliable services at scale. We'll start by exploring research from other industries about how incident culture shapes how reliable they are. We'll then share how we've applied these lessons to our own culture. Along the way we'll give a peek at our SEV tool, some insight into our SEV review process, and describe how we encourage a "culture of SEVs" from the very first day an engineer arrives at Meta.

Speaker Joe Gasperetti,Meta

Speaker Nick Egebo,Meta

02:45 PM - 03:15 PM

Live Q&A Session

WATCH NOW

All Speakers + Moderated by Christian Monzon (Meta)

SPEAKERS AND MODERATORS

Santosh Janardhan is the head of infrastructure at Meta, where he supports the teams... read more

SANTOSH JANARDHAN

Meta

As Vice President of Engineering, Max Ross leads the development and operation of the... read more

Max Ross

Roblox

Ian Swett is the Manager of Google Cloud Networking's Protocols and Web Performance teams.... read more

Ian Swett

Google

James Kretchmar is Vice President and CTO of Akamai's Edge Technology Group, responsible for... read more

James Kretchmar

Akamai

Jana Iyengar is the Product Lead for Infrastructure Services at Fastly, where he is... read more

Jana Iyengar

Fastly

Hossein is VP of Engineering at Fastly, where he leads Network Systems, an organization... read more

Hossein Lotfi

Fastly

Prasad Kalyanaraman has been with Amazon for over 17 years. He leads the AWS... read more

Prasad Kalyanaraman

AWS

Jeremy Hartman is currently serving as Senior Vice President of Production Engineering at Cloudflare.... read more

Jeremy Hartman

Cloudflare

Anca is a seasoned software engineer with over 11 years of experience at Meta,... read more

Anca Agape

Meta

I enjoy working on everything web-related. Have been doing that since 2008, still have... read more

Kostiantyn Tsaregradskyi

Meta

Keshav is a Production Engineer at Instagram and is passionate about building reliable infrastructure.... read more

Keshav Varma

Meta

Thote Gowda has been working with Meta for close to 4 years. He has... read more

Thote Gowda

Meta

With DevOps and System Admin background, Yi Yu joined the Disaster Recovery team at... read more

Yi Yu

Meta

Yuri Grinshteyn strongly believes that reliability is a key feature of any service and... read more

Yuri Grinshteyn

Google

I'm Avery, a Production Engineer on the Configuration Management team at Meta. I've been... read more

Avery Berchek

Meta

Joe is currently a production engineer on Meta’s Reliability Engineering initiative. Over the last... read more

Joe Gasperetti

Meta

Nick joined Meta in 2018 as a Production Engineering manager. He is responsible for... read more

Nick Egebo

Meta

LATEST NOTES

@Scale engineers pencil blogs, articles, and academic papers to further inform and inspire the engineering community.

Systems & Reliability @Scale

08/31/2022

Reliably Changing Configuration @ Scale

Thousands of services at Meta utilize Configuration Management. Because of this, changing configuration reliably is essential. In this post, I...

UPCOMING EVENT | Systems and Networking

Systems & Reliability 2026

June 25, 2026 Meydenbauer Center, Bellevue, Washington Building the advanced infrastructure necessary to power today's sophisticated AI models represents a monumental engineering challenge. This endeavor demands the creation of highly scalable, high-performance, and supremely reliable...

UPCOMING EVENT | Systems and Networking

Networking 2026

August 25, 2026 Santa Clara Convention Center, Santa Clara, CA In 2026, @Scale: Networking will continue to focus on the evolution of AI Networking. To address the growing complexity of network operations, we will examine...

UPCOMING EVENT | Mobile, Video and Web

Product 2026

October 28, 2026 Meta Campus, Menlo Park, CA @Scale: Product is an exciting evolution of the @Scale conference series, uniting the best of Product, RTC, Mobile, and Video under a single AI-native theme. We are...

PAST EVENT 06/17/2026 | Data, Machine Learning and AI

AI & Data 2026

June 17, 2026 Meta Campus, Menlo Park, CA Meta’s Engineering and Infrastructure teams are excited to bring together a global contingent of engineers who are interested in building, operating, and using AI and data systems...

Reliability @Scale Summer 2022

ABOUT EVENT

EVENT AGENDA

August 31

August 31

August 31

SPEAKERS AND MODERATORS

SANTOSH JANARDHAN

Max Ross

Ian Swett

James Kretchmar

Jana Iyengar

Hossein Lotfi

Prasad Kalyanaraman

Jeremy Hartman

Anca Agape

Kostiantyn Tsaregradskyi

Keshav Varma

Thote Gowda

Yi Yu

Yuri Grinshteyn

Avery Berchek

Joe Gasperetti

Nick Egebo

LATEST NOTES

Reliably Changing Configuration @ Scale￼

Systems & Reliability 2026

Networking 2026

Product 2026

AI & Data 2026

Reliably Changing Configuration @ Scale