Data, Systems and Networking 9/27/2023

Scribe: Improving Reliability One 9 at a Time

Scribe is at the heart of data transport at Meta. From revenue critical data to important monitoring datasets flow through the system. This sets an extremely high reliability bar for the system. In this talk, we will take you through our journey of adding the 5th 9 to our SLAs. We’ll talk about how we […]
WATCH NOW

TRENDING POSTS

8/16/2023
How Meta’s Reels APIs is Empowering the Passion Economy
READ MORE
7/11/2023
ServiceRouter: Hyperscale Service Mesh at Meta
READ MORE

SORT

TOPIC
@SCALE SERIES
TYPE
DATE
TAGS
17 RESULTS
CLEAR ALL
Data, Systems and Networking 9/27/2023
Closing Remarks
WATCH NOW
Data, Systems and Networking 9/27/2023
Live Q&A with Speakers
WATCH NOW
Data, Systems and Networking 9/27/2023
A Reliability Maturity Model Explained
In this presentation, we focus on the organization and the ethos that contribute to the reliability that the organization is capable of sustaining for its products. We describe stages of reliability maturity that an engineering organization can transverse; how to determine where in the continuum of maturity your organization currently falls and provide some ideas […]
WATCH NOW
Data, Systems and Networking 9/27/2023
Building a Culture of Reliability
Building reliability engineering programs requires not only technological approaches, but also attention to the underlying culture of an organization and how it can help or hinder efforts to enhance reliability. This talk explores how and why cultural values, as part of a broader systems analysis, can help drive reliability. Specifically, this talk presents an anthropologist’s […]
WATCH NOW
Data, Systems and Networking 9/27/2023
A Cultural Foundation of Operational Excellence: Amazon’s Mechanisms for Resiliency
Operational excellence is hard but it’s impossible without a strong culture. One of Amazon’s founding ideals is operational excellence; our store wouldn’t work without it and AWS especially demands deep operational excellence. Learn about how Amazon’s culture of obsessive, rigorous ownership and our unrelenting focus on building mechanisms & truth-seeking undergird everything we do at […]
WATCH NOW
Data, Systems and Networking 9/27/2023
Improving Reliability Through Data-Driven Engineering & Culture Changes
In the past 2 years, we have leveraged a data-driven strategy to improve the state of system reliability across Monetization. The first step was to quantify the impact of reliability work by defining a longitudinal metric that measures the impact of reliability failures on the business. For the Advertising business at Meta, the negative impact […]
WATCH NOW
Data, Systems and Networking 9/27/2023
Live Q&A with Speakers
WATCH NOW
Data, Systems and Networking 9/27/2023
Scribe: Improving Reliability One 9 at a Time
Scribe is at the heart of data transport at Meta. From revenue critical data to important monitoring datasets flow through the system. This sets an extremely high reliability bar for the system. In this talk, we will take you through our journey of adding the 5th 9 to our SLAs. We’ll talk about how we […]
WATCH NOW
Data, Systems and Networking 9/27/2023
Migrations at Scale: Learnings & Patterns from Zero-Downtime Migrations
Zero-downtime migrations have become an indispensable part of modern software engineering, enabling organizations to smoothly transition complex systems while ensuring uninterrupted operations. In this session, we will deep dive into the intricacies of zero-downtime migrations, delving into the “what, why, and how” behind these seamless transitions at scale.
WATCH NOW
Data, Systems and Networking 9/27/2023
Building Resilient Monitoring at Meta
Meta’s monitoring infrastructure is responsible for monitoring the health of thousands of systems deployed on millions of heterogeneous, geographically distributed hosts. Monitoring the health of Meta’s infrastructure is crucial to both our users and our business. And, monitoring is especially important during widespread failures. This talk explains the journey of hardening Meta’s monitoring systems to […]
WATCH NOW
Data, Systems and Networking 9/27/2023
Large Language Models for Automatic Cloud Incident Management
Building reliable hyper-scale cloud services can be challenging. We need to quickly detect, analyze and mitigate incidents, which largely rely on human effort today. Recent breakthroughs in Large-Language Models (LLMs) have motivated us to explore their potential for automated incident diagnosis. By leveraging LLMs, we aim to accelerate the incident resolution process, leading to improved […]
WATCH NOW
Data, Systems and Networking 9/27/2023
Evolution of Disaster Recovery @ Meta
Disaster Recovery (DR) is our program to prepare Meta’s infrastructure to handle capacity outages. In this talk, we present our story on how the DR program has evolved. From handling single region failures to new risks on the horizon, like power and multi-region outages. Also, DR strategies such as Power Storms and Site Degradation towards […]
WATCH NOW

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy