May 12, 2025

Experience Operating Large GPU Clusters at Organizational Scale

Topic: Systems and Networking

Vipin Sirohi

NVIDIA

Mohamed Fawzy

NVIDIA

Bugra Gedik

NVIDIA

Vikas Mehta

NVIDIA

TYPE: Videos

YEAR: 2025

We outline Nvidia’s experience managing a large-scale internal GPU compute platform spanning multiple heterogeneous clusters. The platform supports thousands of users and hundreds of project accounts, handling a diverse mix of training and batch inference workloads across various research fields. We focus on three key challenges: researcher productivity, resource utilization, and operational efficiency. To improve researcher productivity, we emphasize fair scheduling and workload resilience. To keep resource utilization high, we discuss strategies to maintain high occupancy. On the operational efficiency front, we highlight our scheduler simulation capabilities, which enable safe testing of changes without affecting production workloads. The presentation concludes with key lessons learned and our vision for future improvements.

SUBSCRIBE TO @SCALE

← Back

Experience Operating Large GPU Clusters at Organizational Scale

Vipin Sirohi

Mohamed Fawzy

Bugra Gedik

Vikas Mehta

TYPE: Videos

YEAR: 2025

SUBSCRIBE TO @SCALE

Thank you for your response. ✨

RECENT POSTS

RELATED POSTS