JUNE 08, 2022

How Meta Keeps its Large-scale Infrastructure Hardware Up and Running

Fred Lin

TOPIC: Data, Systems and Networking

@SCALE SERIES: Systems @Scale

TYPE: video

YEAR: 2022

TAGS:

Internet services like Facebook, Instagram, and Whatsapp rely on large-scale infrastructure to support the various compute, storage, and AI workloads. With the support of data and ML techniques, we can scale our infrastructure successfully by improving the efficiency of our tooling and workflows. In this presentation we’ll share our recent work on hardware remediation, automated anomaly detection and root cause analysis, error reporting interrupt tuning for minimizing performance overhead, near-real time at-scale server reboot detection, and an ML framework for predicting repairs for hardware failures. The data and ML solutions help us engage people less but with more context, so we can focus people on the real challenging work while the repetitive tasks are automated.

SUBSCRIBE TO @SCALE

TOPICS

Data, Systems and Networking Dev Tools and Ops, Privacy, Sustainability and Performance Fighting Abuse and Security Machine Learning and AI Mobile, Video and Web