Reliability @Scale Summer 2022

AUGUST 31, 2022 @ 10:00 AM PDT - 3:15 PM PDT

Designed for engineers and technologists who specialize and find interest in how information moves and flows throughout products.

RSVPS CLOSED

AGENDA SPEAKERS

ABOUT EVENT

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems at massive scale. Whether it’s novel design decisions, or outages that impact billions of people, providing reliable experiences for Systems at this scale present unique technical challenges. The @Scale community focuses on bringing people together to discuss these challenges and collaborate on the development of new solutions.

Reliability @Scale will be hosted virtually. Joining us are speakers from Akamai, Cloudflare, Fastly, Google, Meta, and Roblox. The event will be hosted on August 31, 2022 with talks themed around large-scale outages, incident response and learnings, and measuring reliability at scale.

EVENT AGENDA

Event times below are displayed in PT.

August 31

10:00 AM - 10:15 AM

How We Drained Every Backbone Router Simultaneously

WATCH NOW

Presented by: Santosh Janardhan (Meta)

Speaker SANTOSH JANARDHAN,Meta

10:15 AM - 10:40 AM

Lessons Learned from the Halloween Outage

WATCH NOW

In this talk, VP of Engineering Max Ross will discuss the 73 hour outage that impacted Roblox late last year. He will also share some of the ways that a multi-day outage can turn conventional reliability wisdom on its head.

Speaker Max ,Roblox

10:40 AM - 10:55 AM

QUIC Exit: Exposing a New Class of Outage

WATCH NOW

A crash bug in QUIC handshake code exposed a new class of bugs we termed ‘contagion bugs’. For these bugs, a tiny number of tasks can cause a huge outage and rollbacks don’t work as expected. This talk details what contagion bugs are, discusses the details of the outage, and what we did to prevent and mitigate them going forward.

Speaker Ian Swett,Google

10:55 AM - 11:10 AM

Service Incident Deep Dive: Technical Overview & Learnings

WATCH NOW

This talk will provide a technical overview of a service incident on the Akamai platform in July 2021 which, despite layers of safety technologies, nevertheless impacted some of Akamai’s customers. In addition to exploring the technical underpinnings of the incident, there will be discussion of lessons learned and actions taken to broadly reduce the risk of recurrence.

Speaker James Kretchmar,Akamai

11:10 AM - 11:25 AM

Lessons From Long-Running Investigations

WATCH NOW

In this talk, we share some lessons from several of our long-running investigations. Some of them are well-known, but are worth repeating, and some of them are things we learned and want to share.

Speaker Jana Iyengar,Fastly

Speaker Hossein Lotfi,Fastly

11:25 AM - 11:45 AM

AWS Infrastructure: Engineering for Resiliency at Scale

WATCH NOW

Presented by: Prasad Kalyanaraman (AWS)

Speaker Prasad Kalyanaraman,AWS

11:45 AM - 12:00 PM

Pipefail Overview and Discussion

WATCH NOW

Presented by: Jeremy Hartman (Cloudflare)

Speaker Jeremy Hartman,Cloudflare

12:00 PM - 12:45 PM

Live Panel

WATCH NOW

All Panelists + Moderated by Anca Agape (Meta)

Speaker Anca Agape,Meta

12:45 PM - 12:55 PM

Break

12:55 PM - 01:15 PM

Improving Reliability @ Meta: By Analyzing Historical Events That Led to SLO Violations

WATCH NOW

Learn about culture of tracking Service Level Indicators/Service Level Objectives at Instagram specifically and Meta in general, the tools that we use and how teams' SLI/SLO workflows can be improved by annotating SLO violations and analysing them later. In the talk we will briefly cover history of SLI/SLO tracking at Meta, then talk about how Instagram team used data annotations to tackle some of the reliability issues they had and how we're expanding this approach to the whole company.

Speaker Kostiantyn Tsaregradskyi,Meta

Speaker Keshav Varma,Meta

01:15 PM - 01:40 PM

Service Degradation at Scale: Creating Instant Capacity

WATCH NOW

We will talk about what factors made us realize that service degradation is necessary for our infrastructure and the challenges we faced while implementing service degradation at scale. We will also speak about how we are changing our Fault Tolerance Strategy to use service degradation instead of provisioning extra buffer.

Speaker Thote Gowda,Meta

Speaker Yi Yu,Meta

01:40 PM - 02:10 PM

Shrinking the Impact of Production Incidents

WATCH NOW

Shrinking Production Incidents details an organized approach for reducing the overall impact of production outages.

Attendees can expect to learn how to prioritize reliability-related engineering tasks based on incident postmortem data, focusing on tasks that:

Reduce time to detection of the incident
Shorten the time to repair
Expand the time between failures

Speaker Yuri Grinshteyn,Google

02:10 PM - 02:30 PM

Reliably Changing Configuration @ Scale

WATCH NOW

Thousands of services at Meta use Configuration Management, so it is important we change that configuration reliably. Tune in for a story spanning several years, covering how we exponentially grew coverage of a protection mechanism for one our most critical developer workflows. Along the way, we'll dive into some specifics of challenges we faced and overcame to reliably change configuration at scale.

Speaker Avery Berchek,Meta

02:30 AM - 02:45 AM

Meta's SEV Culture: How Today's SEVs Create Tomorrow's Reliability

WATCH NOW

Would you believe us if we said the more SEVs we have, the more reliable we are? In this talk we'll talk about the reasons why we love SEVs at Meta, and how our culture around SEVs has allowed us to build reliable services at scale. We'll start by exploring research from other industries about how incident culture shapes how reliable they are. We'll then share how we've applied these lessons to our own culture. Along the way we'll give a peek at our SEV tool, some insight into our SEV review process, and describe how we encourage a "culture of SEVs" from the very first day an engineer arrives at Meta.

Speaker Joe Gasperetti,Meta

Speaker Nick Egebo,Meta

02:45 PM - 03:15 PM

Live Q&A Session

WATCH NOW

All Speakers + Moderated by Christian Monzon (Meta)

SPEAKERS AND MODERATORS

Santosh Janardhan is the head of infrastructure at Meta, where he supports the teams... read more

SANTOSH JANARDHAN

Meta

As Vice President of Engineering, Max Ross leads the development and operation of the... read more

Max Ross

Roblox

Ian Swett is the Manager of Google Cloud Networking's Protocols and Web Performance teams.... read more

Ian Swett

Google

James Kretchmar is Vice President and CTO of Akamai's Edge Technology Group, responsible for... read more

James Kretchmar

Akamai

Jana Iyengar is the Product Lead for Infrastructure Services at Fastly, where he is... read more

Jana Iyengar

Fastly

Hossein is VP of Engineering at Fastly, where he leads Network Systems, an organization... read more

Hossein Lotfi

Fastly

Prasad Kalyanaraman has been with Amazon for over 17 years. He leads the AWS... read more

Prasad Kalyanaraman

AWS

Jeremy Hartman is currently serving as Senior Vice President of Production Engineering at Cloudflare.... read more

Jeremy Hartman

Cloudflare

Anca is a seasoned software engineer with over 11 years of experience at Meta,... read more

Anca Agape

Meta

I enjoy working on everything web-related. Have been doing that since 2008, still have... read more

Kostiantyn Tsaregradskyi

Meta

Keshav is a Production Engineer at Instagram and is passionate about building reliable infrastructure.... read more

Keshav Varma

Meta

Thote Gowda has been working with Meta for close to 4 years. He has... read more

Thote Gowda

Meta

With DevOps and System Admin background, Yi Yu joined the Disaster Recovery team at... read more

Yi Yu

Meta

Yuri Grinshteyn strongly believes that reliability is a key feature of any service and... read more

Yuri Grinshteyn

Google

I'm Avery, a Production Engineer on the Configuration Management team at Meta. I've been... read more

Avery Berchek

Meta

Joe is currently a production engineer on Meta’s Reliability Engineering initiative. Over the last... read more

Joe Gasperetti

Meta

Nick joined Meta in 2018 as a Production Engineering manager. He is responsible for... read more

Nick Egebo

Meta

LATEST NOTES

@Scale engineers pencil blogs, articles, and academic papers to further inform and inspire the engineering community.

Systems & Reliability @Scale

08/31/2022

Reliably Changing Configuration @ Scale

Thousands of services at Meta utilize Configuration Management. Because of this, changing configuration reliably is essential. In this post, I...

past EVENT November 20-21, 2024 | Mobile, Video and Web

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...

PAST EVENT March 20, 2024 @ 9am PT - 3pm PT | Mobile, Video and Web

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

Past EVENT May 22, 2024 | Data, Machine Learning and AI

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...

Past EVENT June 12, 2024 | Systems and Networking

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...

Past EVENT JULY 31, 2024 @ 2:30 PM PDT - 7:00 PM PDT - IN PERSON EVENT | AUGUST 7, 2024 @ 2:30 PM PDT - 5:30 PM PDT - VIRTUAL PROGRAM | Data, Machine Learning and AI

AI Infra @Scale 2024

Meta’s Engineering and Infrastructure teams are excited to return for the second year in a row to host AI Infra @Scale on July 31. This year’s event is open to a limited number of in-person...

Past EVENT August 14, 2024 | Mobile, Video and Web

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. The @Scale community focuses on bringing forward people's experiences in creating innovative solutions to large-scale products serving millions or...

Past EVENT September 11, 2024 | Santa Clara Convention Center | Systems and Networking

Networking @Scale 2024

Meta’s Networking team invites you to Networking@scale on September 11th. This year’s event is an in-person event hosted at the Santa Clara Convention center and will also be live streamed for virtual attendees. Registration is...

Past EVENT October 9, 2024 | Systems and Networking

Reliability @Scale 2024

In the digital age, where systems operate at unprecedented scales, the importance of robust configuration management cannot be overstated. This year’s Reliability @Scale will focus on a central theme of "Move Safely", emphasizing the critical...

Past EVENT October 23, 2024 | Mobile, Video and Web

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...

Reliability @Scale Summer 2022

ABOUT EVENT

EVENT AGENDA

August 31

August 31

August 31

SPEAKERS AND MODERATORS

SANTOSH JANARDHAN

Max Ross

Ian Swett

James Kretchmar

Jana Iyengar

Hossein Lotfi

Prasad Kalyanaraman

Jeremy Hartman

Anca Agape

Kostiantyn Tsaregradskyi

Keshav Varma

Thote Gowda

Yi Yu

Yuri Grinshteyn

Avery Berchek

Joe Gasperetti

Nick Egebo

LATEST NOTES

Reliably Changing Configuration @ Scale￼

Video @Scale 2024

RTC @Scale 2024

Data @Scale 2024

Systems @Scale 2024

AI Infra @Scale 2024

Product @Scale 2024

Networking @Scale 2024

Reliability @Scale 2024

Mobile @Scale 2024

Reliably Changing Configuration @ Scale