May 12, 2025

AI Hardware Reliability at Scale

Topic: Systems and Networking

Sriram Sankar

Harish Dixit

TYPE: Videos

YEAR: 2025

This talk will describe our journey with AI hardware reliability (GPU/Silicon) running large scale training and inference in Meta. It will highlight our efforts across the ecosystem, covering vendor systems and our own custom silicon efforts to run AI hardware reliably at scale. For SW/Services audience, this will provide a under-the-hood look into how AI hardware reliability impacts AI applications and how Meta is driving the industry.

SUBSCRIBE TO @SCALE

← Back

AI Hardware Reliability at Scale

Sriram Sankar

Harish Dixit

TYPE: Videos

YEAR: 2025

SUBSCRIBE TO @SCALE

Thank you for your response. ✨

RECENT POSTS

RELATED POSTS