Limiting the Blast Radius: Preventing outages from bad configs with datacenter-scale health checks

Configuration changes can easily be catastrophic, with the potential to create broad, instantaneous system outages. We use datacenter-scale health metrics to validate configuration changes before they deploy to all of production. By adopting this validation step broadly at Meta, we have been able to prevent several major incidents. In addition, we use the signature generated by single-datacenter deployments to quickly root cause many other incidents. This talk will delve into the technique of region-scale health checks, the successes achieved, and our ongoing work in the space to prevent future incidents.


To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy