Turbocharging AI/ML workloads: Revving Up Speed and Resilience

Race cars are built for speed and resilience, equipped with cutting-edge features to reach high velocities while maintaining a firm grip on the perilous track. What if we could apply similar features to boost the speed and resilience of AI/ML jobs running over complex networking fabrics?
In this session, we’ll dive into the key networking challenges impacting AI/ML workloads, such as NIC and link flapping, network contention, and congestion. These issues not only slow down job completion times but also eat into ROI by increasing the likelihood of interruptions and costly rollbacks. I’ll demonstrate these challenges in action and introduce a solution that enhances network visibility for AI/ML jobs while ensuring smooth, uninterrupted performance even in the face of link instability or congested paths. By addressing these issues, we can optimize the efficiency of AI/ML jobs, reduce time lost to disruptions, and improve ROI by avoiding the need to revert to previous checkpoints.

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy