How Meta Keeps its Large-scale Infrastructure Hardware Up and Running
Internet services like Facebook, Instagram, and Whatsapp rely on large-scale infrastructure to support the various compute, storage, and AI workloads. With the support of data and ML techniques, we can scale our infrastructure successfully by improving the efficiency of our tooling and workflows. In this presentation we’ll share our recent work on hardware remediation, automated anomaly detection and root cause analysis, error reporting interrupt tuning for minimizing performance overhead, near-real time at-scale server reboot detection, and an ML framework for predicting repairs for hardware failures. The data and ML solutions help us engage people less but with more context, so we can focus people on the real challenging work while the repetitive tasks are automated.