TOPIC: Data, Systems and Networking

Reliability @Scale Summer 2022

AUGUST 31, 2022 @ 10:00 AM - AUGUST 31, 2022 @ 02:30 PM PT
Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems at massive scale. Whether it’s novel design decisions, or outages that impact billions of people, providing reliable experiences for Systems at this scale present unique technical challenges.
RSVPS CLOSED
AGENDA SPEAKERS

ABOUT EVENT

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems at massive scale. Whether it’s novel design decisions, or outages that impact billions of people, providing reliable experiences for Systems at this scale present unique technical challenges. The @Scale community focuses on bringing people together to discuss these challenges and collaborate on the development of new solutions.

Reliability @Scale will be hosted virtually. Joining us are speakers from Akamai, Cloudflare, Fastly, Google, Meta, and Roblox. The event will be hosted on August 31, 2022 with talks themed around large-scale outages, incident response and learnings, and measuring reliability at scale.

EVENT AGENDA

Event times below are displayed in PT.

August 31

10:00 AM - 10:15 AM
How We Drained Every Backbone Router Simultaneously

Presented by: Santosh Janardhan (Meta)

SPEAKER SANTOSH JANARDHAN,Meta
10:15 AM - 10:40 AM
Lessons Learned from the Halloween Outage

In this talk, VP of Engineering Max Ross will discuss the 73 hour outage that impacted Roblox late last year. He will also share some of the ways that a multi-day outage can turn conventional reliability wisdom on its head.

SPEAKER Max ,Roblox
10:40 AM - 10:55 AM
QUIC Exit: Exposing a New Class of Outage

A crash bug in QUIC handshake code exposed a new class of bugs we termed ‘contagion bugs’. For these bugs, a tiny number of tasks can cause a huge outage and rollbacks don’t work as expected. This talk details what contagion bugs are, discusses the details of the outage, and what we did to prevent and mitigate them going forward.

SPEAKER Ian Swett,Google
10:55 AM - 11:10 AM
Service Incident Deep Dive: Technical Overview & Learnings

This talk will provide a technical overview of a service incident on the Akamai platform in July 2021 which, despite layers of safety technologies, nevertheless impacted some of Akamai’s customers. In addition to exploring the technical underpinnings of the incident, there will be discussion of lessons learned and actions taken to broadly reduce the risk of recurrence.

SPEAKER James Kretchmar,Akamai
11:10 AM - 11:25 AM
Lessons From Long-Running Investigations

In this talk, we share some lessons from several of our long-running investigations. Some of them are well-known, but are worth repeating, and some of them are things we learned and want to share.

SPEAKER Jana Iyengar,Fastly
SPEAKER Hossein Lotfi,Fastly
11:25 AM - 11:45 AM
AWS Infrastructure: Engineering for Resiliency at Scale

Presented by: Prasad Kalyanaraman (AWS)

SPEAKER Prasad Kalyanaraman,AWS
11:45 AM - 12:00 PM
Pipefail Overview and Discussion

Presented by: Jeremy Hartman (Cloudflare)

SPEAKER Jeremy Hartman,Cloudflare
12:00 PM - 12:45 PM
Live Panel

All Panelists + Moderated by Anca Agape (Meta)

SPEAKER Anca Agape,Meta
12:45 PM - 12:55 PM
Break
12:55 PM - 01:15 PM
Improving Reliability @ Meta: By Analyzing Historical Events That Led to SLO Violations

Learn about culture of tracking Service Level Indicators/Service Level Objectives at Instagram specifically and Meta in general, the tools that we use and how teams' SLI/SLO workflows can be improved by annotating SLO violations and analysing them later. In the talk we will briefly cover history of SLI/SLO tracking at Meta, then talk about how Instagram team used data annotations to tackle some of the reliability issues they had and how we're expanding this approach to the whole company.

SPEAKER Kostiantyn Tsaregradskyi,Meta
SPEAKER Keshav Varma,Meta
01:15 PM - 01:40 PM
Service Degradation at Scale: Creating Instant Capacity

We will talk about what factors made us realize that service degradation is necessary for our infrastructure and the challenges we faced while implementing service degradation at scale. We will also speak about how we are changing our Fault Tolerance Strategy to use service degradation instead of provisioning extra buffer.

SPEAKER Thote Gowda,Meta
SPEAKER Yi Yu,Meta
01:40 PM - 02:10 PM
Shrinking the Impact of Production Incidents

Shrinking Production Incidents details an organized approach for reducing the overall impact of production outages.

Attendees can expect to learn how to prioritize reliability-related engineering tasks based on incident postmortem data, focusing on tasks that:

  • Reduce time to detection of the incident
  • Shorten the time to repair
  • Expand the time between failures
SPEAKER Yuri Grinshteyn,Google
02:10 PM - 02:30 PM
Reliably Changing Configuration @ Scale

Thousands of services at Meta use Configuration Management, so it is important we change that configuration reliably. Tune in for a story spanning several years, covering how we exponentially grew coverage of a protection mechanism for one our most critical developer workflows. Along the way, we'll dive into some specifics of challenges we faced and overcame to reliably change configuration at scale.

SPEAKER Avery Berchek,Meta
02:30 AM - 02:45 AM
Meta's SEV Culture: How Today's SEVs Create Tomorrow's Reliability

Would you believe us if we said the more SEVs we have, the more reliable we are? In this talk we'll talk about the reasons why we love SEVs at Meta, and how our culture around SEVs has allowed us to build reliable services at scale. We'll start by exploring research from other industries about how incident culture shapes how reliable they are. We'll then share how we've applied these lessons to our own culture. Along the way we'll give a peek at our SEV tool, some insight into our SEV review process, and describe how we encourage a "culture of SEVs" from the very first day an engineer arrives at Meta.

SPEAKER Joe Gasperetti,Meta
SPEAKER Nick Egebo,Meta
02:45 PM - 03:15 PM
Live Q&A Session

All Speakers + Moderated by Christian Monzon (Meta)

SPEAKERS AND MODERATORS

Santosh Janardhan is the Vice President of Infrastructure at Meta. read more

SANTOSH JANARDHAN

Meta

As Vice President of Engineering, Max Ross leads the development and operation of the distributed computing platform that supports back-end... read more

Max Ross

Roblox

Ian Swett is the Manager of Google Cloud Networking's Protocols and Web Performance teams. Ian was heavily involved in the... read more

Ian Swett

Google

James Kretchmar is Vice President and CTO of Akamai's Edge Technology Group, responsible for the architecture and technology strategy of... read more

James Kretchmar

Akamai

Jana Iyengar is the Product Lead for Infrastructure Services at Fastly, where he is responsible for the core hardware, software,... read more

Jana Iyengar

Fastly

Hossein is VP of Engineering at Fastly, where he leads Network Systems, an organization responsible for building reliable, cost-effective, and... read more

Hossein Lotfi

Fastly

Prasad Kalyanaraman has been with Amazon for over 17 years. He leads the AWS Infrastructure Services organization, that is responsible... read more

Prasad Kalyanaraman

AWS

Jeremy Hartman is currently serving as Senior Vice President of Production Engineering at Cloudflare. He is responsible for overall availability... read more

Jeremy Hartman

Cloudflare

Anca Agape is a software engineer working on teams across Meta’s Infrastructure for the past 9 years. She is currently... read more

Anca Agape

Meta

I enjoy working on everything web-related. Have been doing that since 2008, still have flashbacks from debugging JavaScript code with... read more

Kostiantyn Tsaregradskyi

Meta

Keshav is a Production Engineer at Instagram and is passionate about building reliable infrastructure. He previously worked with the Cassandra... read more

Keshav Varma

Meta

Thote Gowda has been working with Meta for close to 4 years. He has a keen interest in working on... read more

Thote Gowda

Meta

With DevOps and System Admin background, Yi Yu joined the Disaster Recovery team at Meta as a Production Engineer in... read more

Yi Yu

Meta

Yuri Grinshteyn strongly believes that reliability is a key feature of any service and works to advocate for Site Reliability... read more

Yuri Grinshteyn

Google

I'm Avery, a Production Engineer on the Configuration Management team at Meta. I've been here four years, and while I've... read more

Avery Berchek

Meta

Joe is currently a production engineer on Meta’s Reliability Engineering initiative. Over the last decade at Meta, he’s been an... read more

Joe Gasperetti

Meta

Nick joined Meta in 2018 as a Production Engineering manager. He is responsible for the infrastructure that powers messaging services... read more

Nick Egebo

Meta

LATEST NOTES

@Scale engineers pencil blogs, articles, and academic papers to further inform and inspire the engineering community.

Reliability @Scale
08/31/2021
Reliably Changing Configuration @ Scale
Thousands of services at Meta utilize Configuration Management. Because of this, changing configuration reliably is essential. In this post, I...

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy