Silent Errors in Large-Scale LLM training: Challenges and Lessons Learned

GPU cluster reliability is a growing challenge as AI models and the clusters that host them grow to unprecedented scale. Insidious errors such as Silent Data Corruptions (SDCs) are particularly difficult to address due to their highly elusive and non-deterministic nature, and their effect on large-scale LLM training and inference is poorly understood. In this talk, we will present how NVIDIA is leveraging its deep expertise in GPUs and AI to holistically tackle this challenge from silicon to data centers. We will go over the work we are doing to improve our understanding of these complex errors and their effect in real world at-scale AI cluster deployments, and the solutions we are developing to help researchers, cluster builders, and the industry protect against SDCs.

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy