August 15, 2022

How Meta Keeps its Large-scale Infrastructure Hardware Up and Running

Fred Lin

TYPE: Videos

YEAR: 2022

Internet services like Facebook, Instagram, and Whatsapp rely on large-scale infrastructure to support the various compute, storage, and AI workloads. With the support of data and ML techniques, we can scale our infrastructure successfully by improving the efficiency of our tooling and workflows. In this presentation we’ll share our recent work on hardware remediation, automated anomaly detection and root cause analysis, error reporting interrupt tuning for minimizing performance overhead, near-real time at-scale server reboot detection, and an ML framework for predicting repairs for hardware failures. The data and ML solutions help us engage people less but with more context, so we can focus people on the real challenging work while the repetitive tasks are automated.

SUBSCRIBE TO @SCALE

← Back

How Meta Keeps its Large-scale Infrastructure Hardware Up and Running

Fred Lin

TYPE: Videos

YEAR: 2022

SUBSCRIBE TO @SCALE

Thank you for your response. ✨

RECENT POSTS

RELATED POSTS