Meta’s monitoring infrastructure is responsible for monitoring the health of thousands of systems deployed on millions of heterogeneous, geographically distributed hosts. Monitoring the health of Meta’s infrastructure is crucial to both our users and our business. And, monitoring is especially important during widespread failures. This talk explains the journey of hardening Meta’s monitoring systems to be among our most resilient infrastructure – available when most other systems are degraded (and when we most need monitoring). This session will touch on both engineering culture and the technical strategies (e.g., workload isolation, graceful degradation) and cultural/process strategies (e.g. meetings, tracking) that we leveraged to improve the resiliency of monitoring systems at Meta. Outline Overview of Monitoring @ Meta When Monitoring Fails: What we learned from the Facebook 2021 outage Building Culture: Resiliency as a core value Building Resilient Systems: Know thy enemy Iterative Improvement: Measuring resilience
- WATCH NOW
- VIEW EVENTS
- 2023
- JANUARY
- No Events
- FEBRUARY
- no events
- MARCH
- RTC @Scale 2023
- April
- no events
- May
- AI Infra @Scale
- June
- no events
- July
- Systems @Scale Summer 2023
- August
- Product @Scale 2023
- September
- Networking @Scale 2023
- Reliability @Scale 2023
- October
- Mobile @Scale 2023
- November
- Video @Scale 2023
- December
- Systems @Scale Winter 2023
- 2022
- January
- no events
- February
- RTC @Scale 2022
- March
- Systems @Scale Spring 2022
- April
- Product @Scale Spring 2022
- May
- Data @Scale Spring 2022
- June
- Systems @Scale Summer 2022
- Networking @Scale Summer 2022
- July
- no events
- August
- Reliability @Scale Summer 2022
- September
- AI @Scale 2022
- October
- no events
- November
- Networking @Scale Fall 2022
- Video @Scale Fall 2022
- December
- Systems @Scale Winter 2022
- 2021
- 2020
- January
- no events
- February
- no events
- March
- no events
- April
- no events
- May
- no events
- June
- no events
- July
- no events
- August
- Systems @Scale Remote Edition — Summer 2020
- September
- no events
- October
- no events
- November
- Performance @Scale NY 2020
- Keeping the Lights On @Scale
- AI @Scale 2020
- December
- no events
- 2019
- January
- no events
- February
- no events
- March
- no events
- April
- no events
- May
- no events
- June
- Performance @Scale 2019
- Systems @Scale Summer 2019
- July
- no events
- August
- no events
- September
- Networking @Scale California 2019
- Systems @Scale Fall 2019
- Video @Scale 2019
- October
- The @Scale Conference 2019
- November
- Fighting Abuse @Scale 2019
- Systems @Scale Tel Aviv Fall 2019
- Networking @Scale Boston 2019
- December
- no events
- 2018
- January
- Android @Scale 2018
- February
- no events
- March
- Performance @Scale 2018
- April
- Video @Scale 2018
- Fighting Abuse @Scale 2018
- May
- Networking @Scale 2018
- June
- no events
- July
- Systems @Scale Summer 2018
- August
- no events
- September
- The @Scale Conference 2018
- October
- Data @Scale Boston 2018
- November
- Mobile @Scale Tel Aviv 2018
- December
- no events
- 2017
- January
- no events
- February
- Machine Learning @Scale 2017
- Video @Scale 2017
- March
- no events
- April
- no events
- May
- Dev Tools @Scale 2017
- Networking @Scale 2017
- June
- Data @Scale 2017
- July
- no events
- August
- The @Scale Conference 2017
- September
- no events
- October
- Mobile @Scale Boston 2017
- November
- no events
- December
- no events
- 2016
- January
- Video @Scale 2016
- February
- Performance @Scale 2016
- March
- Mobile @Scale 2016
- April
- no events
- May
- Networking @Scale 2016
- June
- Data @Scale 2016
- July
- no events
- August
- The @Scale Conference 2016
- September
- no events
- October
- Boston Networking @Scale 2016
- November
- Spam Fighting 2016
- December
- no events
- 2015
- 2023
- DIVIDER
- EXPLORE TOPICS
- MACHINE LEARNING AND AI
- Data, Systems, and Networking
- MOBILE, VIDEO, AND WEB
- DEV TOOLS AND OPS, PRIVACY, SUSTAINABILITY, AND PERFORMANCE
- Fighting Abuse and Security
- DIVIDER
- Annual @Scale Conference
- Blog
- Community Forum
- Speaker Submissions
- About @Scale