Systems @Scale Summer 2021
Share

Platform Agnostic Observability System for AI Accelerators

Application specific hardware platforms play a crucial role in meeting the growing latency and compute demands of workloads like deep learning, content understanding and video encoding. However, it is challenging to operate these heterogeneous platforms efficiently at scale. In this talk we introduce asicmon – an observability framework for accelerators. Asicmon offers a simple abstraction for the accelerator to upstream monitoring software. Further, it facilitates ease of development by leveraging a custom built specification language – Asimov. With Asimov we could prototype and onboard new accelerators quickly, reducing the onboarding time from months to weeks.

Beyond monitoring, tracing also plays a key part in understanding the performance and interaction between the CPU and accelerator. We developed a tracing framework – atrace to collect and process traces at scale. Atrace provides key insights such as operator profiles and critical path analysis. We also extended the native tracing capabilities by correlating events to the CPU in the open-source Glow and PyTorch software stack. Doing so enabled engineers to close up to a 10% performance gap on pytorch vs caffe2 AI model implementations.

Related Topics

Join the @Scale Mailing List and Get the Latest News & Event Info

Code of Conduct

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy