Race cars are built for speed and resilience, equipped with cutting-edge features to reach high velocities while maintaining a firm grip on the perilous track. What if we could apply similar features to boost the speed and resilience of AI/ML jobs running over complex networking fabrics?
In this session, we’ll dive into the key networking challenges impacting AI/ML workloads, such as NIC and link flapping, network contention, and congestion. These issues not only slow down job completion times but also eat into ROI by increasing the likelihood of interruptions and costly rollbacks. I’ll demonstrate these challenges in action and introduce a solution that enhances network visibility for AI/ML jobs while ensuring smooth, uninterrupted performance even in the face of link instability or congested paths. By addressing these issues, we can optimize the efficiency of AI/ML jobs, reduce time lost to disruptions, and improve ROI by avoiding the need to revert to previous checkpoints.
- WATCH NOW
- 2025 EVENTS
- PAST EVENTS
- 2024
- 2023
- 2022
- February
- RTC @Scale 2022
- March
- Systems @Scale Spring 2022
- April
- Product @Scale Spring 2022
- May
- Data @Scale Spring 2022
- June
- Systems @Scale Summer 2022
- Networking @Scale Summer 2022
- August
- Reliability @Scale Summer 2022
- September
- AI @Scale 2022
- November
- Networking @Scale Fall 2022
- Video @Scale Fall 2022
- December
- Systems @Scale Winter 2022
- 2021
- 2020
- 2019
- 2018
- 2017
- 2016
- 2015
- Blog & Video Archive
- Speaker Submissions