August 13, 2022

Software and Hardware Remediations At Meta | Antonio Davoli & Leandro Silva

Topic: Systems and Networking

Antonio Davoli

Leandro Silva

TYPE: Videos

YEAR: 2022

Efficient software and hardware failure remediations are the foundations for sustaining high fleet availability at large-scale environments such as Meta. In this talk, we will describe the general architecture that we use to maximize fleet availability — by identifying and remediating software issues as well as faulty assets which might require manual intervention from a datacenter technician.

We’ll start by presenting FBAR (Facebook Auto-Remediation), the platform used by our main customers (IG, Messenger, etc) for writing event-based software remediation workflows that are executed to mitigate software issues. We’ll then focus on presenting what happens when issues are instead hardware related, how we detect those faults at scale and how we empowered our repair workflow with a decision engine, called RepairBrain, which helps identify and track which of the repair actions might speed up the mitigation, as well as providing a platform for deploying custom repair actions.

SUBSCRIBE TO @SCALE

← Back

Software and Hardware Remediations At Meta | Antonio Davoli & Leandro Silva

Antonio Davoli

Leandro Silva

TYPE: Videos

YEAR: 2022

SUBSCRIBE TO @SCALE

Thank you for your response. ✨

RECENT POSTS

RELATED POSTS