ML Monitoring & Observability @Meta Scale
ML generates significant value for Meta’s infrastructure, tools, products, and users. It drives a varied set of insights; from end-user products such as recommendations and feeds on Facebook and Instagram, to infrastructure insights for demand prediction and capacity planning. However, problems such as gradient explosions, data corruption, feature coverage and multi-layer performance degradations impact the ML ecosystem. As features, data and models scale, the nature of these problems gets more complex to assess impact, root cause and mitigate — especially with siloed tools, teams and metadata, fragmented and manual run books — spread across the ML lifecycle. In this talk, we provide an overview of ML Challenges at Meta, our take on ML monitoring and observability infrastructure and tooling to solve for these problems. We cover an overview of our platform, use cases, and product experiences.