Systems @Scale Winter 2021
Share

Software and Hardware Remediations At Meta | Antonio Davoli & Leandro Silva

Efficient software and hardware failure remediations are the foundations for sustaining high fleet availability at large-scale environments such as Meta. In this talk, we will describe the general architecture that we use to maximize fleet availability — by identifying and remediating software issues as well as faulty assets which might require manual intervention from a datacenter technician.
We’ll start by presenting FBAR (Facebook Auto-Remediation), the platform used by our main customers (IG, Messenger, etc) for writing event-based software remediation workflows that are executed to mitigate software issues. We’ll then focus on presenting what happens when issues are instead hardware related, how we detect those faults at scale and how we empowered our repair workflow with a decision engine, called RepairBrain, which helps identify and track which of the repair actions might speed up the mitigation, as well as providing a platform for deploying custom repair actions.
Related Topics

Join the @Scale Mailing List and Get the Latest News & Event Info

Code of Conduct

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy