AI OBSERVABILITY AT META SCALE

AI training and inference constitute a large section of Meta’s infrastructure. Executing AI workload requires fast and expensive compute hardware along with powerful networking systems. This poses new challenges to our observability system and also lies opportunities with great potential. In this talk, we present scalable observability infrastructure and tools that enable building faster and more efficient AI software, and how we leverage this data for predictive analysis of efficiency of jobs.

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy