Building Resilient Monitoring at Meta

Adam Phillabaum

David Pariag

TOPIC: Systems and Networking

@SCALE SERIES: Systems and Networking

TYPE: video

YEAR: 2023

TAGS:

Meta’s monitoring infrastructure is responsible for monitoring the health of thousands of systems deployed on millions of heterogeneous, geographically distributed hosts. Monitoring the health of Meta’s infrastructure is crucial to both our users and our business. And, monitoring is especially important during widespread failures. This talk explains the journey of hardening Meta’s monitoring systems to be among our most resilient infrastructure – available when most other systems are degraded (and when we most need monitoring). This session will touch on both engineering culture and the technical strategies (e.g., workload isolation, graceful degradation) and cultural/process strategies (e.g. meetings, tracking) that we leveraged to improve the resiliency of monitoring systems at Meta. Outline Overview of Monitoring @ Meta When Monitoring Fails: What we learned from the Facebook 2021 outage Building Culture: Resiliency as a core value Building Resilient Systems: Know thy enemy Iterative Improvement: Measuring resilience

SUBSCRIBE TO @SCALE

← Back

Thank you for your response. ✨