Building Resilient Monitoring at Meta

Meta’s monitoring infrastructure is responsible for monitoring the health of thousands of systems deployed on millions of heterogeneous, geographically distributed hosts. Monitoring the health of Meta’s infrastructure is crucial to both our users and our business. And, monitoring is especially important during widespread failures. This talk explains the journey of hardening Meta’s monitoring systems to be among our most resilient infrastructure – available when most other systems are degraded (and when we most need monitoring). This session will touch on both engineering culture and the technical strategies (e.g., workload isolation, graceful degradation) and cultural/process strategies (e.g. meetings, tracking) that we leveraged to improve the resiliency of monitoring systems at Meta. Outline Overview of Monitoring @ Meta When Monitoring Fails: What we learned from the Facebook 2021 outage Building Culture: Resiliency as a core value Building Resilient Systems: Know thy enemy Iterative Improvement: Measuring resilience

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy