Application specific hardware platforms play a crucial role in meeting the growing latency and compute demands of workloads like deep learning, content understanding and video encoding. However, it is challenging to operate these heterogeneous platforms efficiently at scale. In this talk we introduce asicmon – an observability framework for accelerators. Asicmon offers a simple abstraction for the accelerator to upstream monitoring software. Further, it facilitates ease of development by leveraging a custom built specification language – Asimov. With Asimov we could prototype and onboard new accelerators quickly, reducing the onboarding time from months to weeks.
Beyond monitoring, tracing also plays a key part in understanding the performance and interaction between the CPU and accelerator. We developed a tracing framework – atrace to collect and process traces at scale. Atrace provides key insights such as operator profiles and critical path analysis. We also extended the native tracing capabilities by correlating events to the CPU in the open-source Glow and PyTorch software stack. Doing so enabled engineers to close up to a 10% performance gap on pytorch vs caffe2 AI model implementations.