TOPIC: Data, Systems and Networking

Reliability @Scale Summer 2022

AUGUST 31, 2022 @ 10:00 AM PDT - 3:15 PM PDT
Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems at massive scale. Whether it’s novel design decisions, or outages that impact billions of people, providing reliable experiences for Systems at this scale present unique technical challenges.
RSVPS CLOSED
AGENDA SPEAKERS

ABOUT EVENT

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems at massive scale. Whether it’s novel design decisions, or outages that impact billions of people, providing reliable experiences for Systems at this scale present unique technical challenges. The @Scale community focuses on bringing people together to discuss these challenges and collaborate on the development of new solutions.

Reliability @Scale will be hosted virtually. Joining us are speakers from Akamai, Cloudflare, Fastly, Google, Meta, and Roblox. The event will be hosted on August 31, 2022 with talks themed around large-scale outages, incident response and learnings, and measuring reliability at scale.

EVENT AGENDA

Event times below are displayed in PT.

August 31

10:00 AM - 10:15 AM
How We Drained Every Backbone Router Simultaneously

Presented by: Santosh Janardhan (Meta)

Speaker SANTOSH JANARDHAN,Meta
10:15 AM - 10:40 AM
Lessons Learned from the Halloween Outage

In this talk, VP of Engineering Max Ross will discuss the 73 hour outage that impacted Roblox late last year. He will also share some of the ways that a multi-day outage can turn conventional reliability wisdom on its head.

Speaker Max ,Roblox
10:40 AM - 10:55 AM
QUIC Exit: Exposing a New Class of Outage

A crash bug in QUIC handshake code exposed a new class of bugs we termed ‘contagion bugs’. For these bugs, a tiny number of tasks can cause a huge outage and rollbacks don’t work as expected. This talk details what contagion bugs are, discusses the details of the outage, and what we did to prevent and mitigate them going forward.

Speaker Ian Swett,Google
10:55 AM - 11:10 AM
Service Incident Deep Dive: Technical Overview & Learnings

This talk will provide a technical overview of a service incident on the Akamai platform in July 2021 which, despite layers of safety technologies, nevertheless impacted some of Akamai’s customers. In addition to exploring the technical underpinnings of the incident, there will be discussion of lessons learned and actions taken to broadly reduce the risk of recurrence.

Speaker James Kretchmar,Akamai
11:10 AM - 11:25 AM
Lessons From Long-Running Investigations

In this talk, we share some lessons from several of our long-running investigations. Some of them are well-known, but are worth repeating, and some of them are things we learned and want to share.

Speaker Jana Iyengar,Fastly
Speaker Hossein Lotfi,Fastly
11:25 AM - 11:45 AM
AWS Infrastructure: Engineering for Resiliency at Scale

Presented by: Prasad Kalyanaraman (AWS)

Speaker Prasad Kalyanaraman,AWS
11:45 AM - 12:00 PM
Pipefail Overview and Discussion

Presented by: Jeremy Hartman (Cloudflare)

Speaker Jeremy Hartman,Cloudflare
12:00 PM - 12:45 PM
Live Panel

All Panelists + Moderated by Anca Agape (Meta)

Speaker Anca Agape,Meta
12:45 PM - 12:55 PM
Break
12:55 PM - 01:15 PM
Improving Reliability @ Meta: By Analyzing Historical Events That Led to SLO Violations

Learn about culture of tracking Service Level Indicators/Service Level Objectives at Instagram specifically and Meta in general, the tools that we use and how teams' SLI/SLO workflows can be improved by annotating SLO violations and analysing them later. In the talk we will briefly cover history of SLI/SLO tracking at Meta, then talk about how Instagram team used data annotations to tackle some of the reliability issues they had and how we're expanding this approach to the whole company.

Speaker Kostiantyn Tsaregradskyi,Meta
Speaker Keshav Varma,Meta
01:15 PM - 01:40 PM
Service Degradation at Scale: Creating Instant Capacity

We will talk about what factors made us realize that service degradation is necessary for our infrastructure and the challenges we faced while implementing service degradation at scale. We will also speak about how we are changing our Fault Tolerance Strategy to use service degradation instead of provisioning extra buffer.

Speaker Thote Gowda,Meta
Speaker Yi Yu,Meta
01:40 PM - 02:10 PM
Shrinking the Impact of Production Incidents

Shrinking Production Incidents details an organized approach for reducing the overall impact of production outages.

Attendees can expect to learn how to prioritize reliability-related engineering tasks based on incident postmortem data, focusing on tasks that:

  • Reduce time to detection of the incident
  • Shorten the time to repair
  • Expand the time between failures
Speaker Yuri Grinshteyn,Google
02:10 PM - 02:30 PM
Reliably Changing Configuration @ Scale

Thousands of services at Meta use Configuration Management, so it is important we change that configuration reliably. Tune in for a story spanning several years, covering how we exponentially grew coverage of a protection mechanism for one our most critical developer workflows. Along the way, we'll dive into some specifics of challenges we faced and overcame to reliably change configuration at scale.

Speaker Avery Berchek,Meta
02:30 AM - 02:45 AM
Meta's SEV Culture: How Today's SEVs Create Tomorrow's Reliability

Would you believe us if we said the more SEVs we have, the more reliable we are? In this talk we'll talk about the reasons why we love SEVs at Meta, and how our culture around SEVs has allowed us to build reliable services at scale. We'll start by exploring research from other industries about how incident culture shapes how reliable they are. We'll then share how we've applied these lessons to our own culture. Along the way we'll give a peek at our SEV tool, some insight into our SEV review process, and describe how we encourage a "culture of SEVs" from the very first day an engineer arrives at Meta.

Speaker Joe Gasperetti,Meta
Speaker Nick Egebo,Meta
02:45 PM - 03:15 PM
Live Q&A Session

All Speakers + Moderated by Christian Monzon (Meta)

SPEAKERS AND MODERATORS

Santosh Janardhan is the head of infrastructure at Meta, where he supports the teams... read more

SANTOSH JANARDHAN

Meta

As Vice President of Engineering, Max Ross leads the development and operation of the... read more

Max Ross

Roblox

Ian Swett is the Manager of Google Cloud Networking's Protocols and Web Performance teams.... read more

Ian Swett

Google

James Kretchmar is Vice President and CTO of Akamai's Edge Technology Group, responsible for... read more

James Kretchmar

Akamai

Jana Iyengar is the Product Lead for Infrastructure Services at Fastly, where he is... read more

Jana Iyengar

Fastly

Hossein is VP of Engineering at Fastly, where he leads Network Systems, an organization... read more

Hossein Lotfi

Fastly

Prasad Kalyanaraman has been with Amazon for over 17 years. He leads the AWS... read more

Prasad Kalyanaraman

AWS

Jeremy Hartman is currently serving as Senior Vice President of Production Engineering at Cloudflare.... read more

Jeremy Hartman

Cloudflare

Anca Agape is a software engineer working on teams across Meta’s Infrastructure for the... read more

Anca Agape

Meta

I enjoy working on everything web-related. Have been doing that since 2008, still have... read more

Kostiantyn Tsaregradskyi

Meta

Keshav is a Production Engineer at Instagram and is passionate about building reliable infrastructure.... read more

Keshav Varma

Meta

Thote Gowda has been working with Meta for close to 4 years. He has... read more

Thote Gowda

Meta

With DevOps and System Admin background, Yi Yu joined the Disaster Recovery team at... read more

Yi Yu

Meta

Yuri Grinshteyn strongly believes that reliability is a key feature of any service and... read more

Yuri Grinshteyn

Google

I'm Avery, a Production Engineer on the Configuration Management team at Meta. I've been... read more

Avery Berchek

Meta

Joe is currently a production engineer on Meta’s Reliability Engineering initiative. Over the last... read more

Joe Gasperetti

Meta

Nick joined Meta in 2018 as a Production Engineering manager. He is responsible for... read more

Nick Egebo

Meta

LATEST NOTES

@Scale engineers pencil blogs, articles, and academic papers to further inform and inspire the engineering community.

Reliability @Scale
08/31/2022
Reliably Changing Configuration @ Scale
Thousands of services at Meta utilize Configuration Management. Because of this, changing configuration reliably is essential. In this post, I...
UPCOMING EVENT   07/31/2024 AI @Scale

AI Infra @Scale 2024

Meta's Engineering and Infrastructure teams are excited to host AI Infra @Scale, a one-day virtual event featuring a range of speakers from Meta who will unveil the latest AI infrastructure investments and innovations powering Meta's...
UPCOMING EVENT   August 7, 2024 Product @Scale

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. The @Scale community focuses on bringing forward people's experiences in creating innovative solutions to large-scale products serving millions or...
UPCOMING EVENT   September 4-5, 2024 (2 day event) Networking @Scale

Networking @Scale 2024

Networking @Scale is a technical conference for engineers that build and manage large-scale networks. Meta’s Networking Infrastructure team is excited to host Networking @Scale, a two-day virtual event featuring a range of speakers from Meta...
UPCOMING EVENT   October 9, 2024 Reliability @Scale

Reliability @Scale 2024

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems and products at massive scale. Whether it’s novel design decisions, or outages that impact billions...
UPCOMING EVENT   October 23, 2024 Mobile @Scale

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...
UPCOMING EVENT   November 20, 2024 Video @Scale

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...
PAST EVENT   March 20, 2024 @ 9am PT - 3pm PT RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...
Past EVENT   May 22, 2024 Data @Scale

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...
Past EVENT   June 12, 2024 Systems @Scale

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy