SEPTEMBER 07, 2023

Arcadia: End-to-end AI System Performance Simulator

This presentation will introduce Arcadia, a unified system designed to simulate compute, memory, and network performance of AI training clusters. By providing a multi-disciplinary performance analysis framework, Arcadia aims to facilitate the design and optimization of various system levels, including application, network, and hardware. This comprehensive system enables researchers and practitioners to gain valuable insights into the performance of future AI models and workloads on specific infrastructures, fostering data-driven decision-making processes and promoting the future evolution of models and hardware. Arcadia provides ability to simulate performance impact of scheduled operational tasks on AI-models that are running in production; helps an engineer to make job-aware decisions during day-to-day operational activity. Attendees will learn about the capabilities and potential impact of Arcadia in advancing the field of AI systems and infrastructure.

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy