June 30, 2025

Silent Errors in Large-Scale LLM training: Challenges and Lessons Learned

Topic: Data, Machine Learning and AI

Devin O’Kelly

NVIDIA

Cyril Meurillon

NVIDIA

TYPE: Videos

YEAR: 2025

GPU cluster reliability is a growing challenge as AI models and the clusters that host them grow to unprecedented scale. Insidious errors such as Silent Data Corruptions (SDCs) are particularly difficult to address due to their highly elusive and non-deterministic nature, and their effect on large-scale LLM training and inference is poorly understood. In this talk, we will present how NVIDIA is leveraging its deep expertise in GPUs and AI to holistically tackle this challenge from silicon to data centers. We will go over the work we are doing to improve our understanding of these complex errors and their effect in real world at-scale AI cluster deployments, and the solutions we are developing to help researchers, cluster builders, and the industry protect against SDCs.

SUBSCRIBE TO @SCALE

← Back

Silent Errors in Large-Scale LLM training: Challenges and Lessons Learned

Devin O’Kelly

Cyril Meurillon

TYPE: Videos

YEAR: 2025

SUBSCRIBE TO @SCALE

Thank you for your response. ✨

RECENT POSTS

RELATED POSTS