Network and Collective Communication stack plays a pivotal role in extracting the best performance out of large GenAI Clusters. In this talk, we will go over in-depth Network and Communicational library tuning that helped achieve optimal performance for GenAI Models such as LLaMA3. We’ll touch on both optimizations, from training workload as well as model serving perspective. We’ll dig into how we mitigated the impact of network latency by implementing novel collective algorithms, network routing enhancements and steps taken to reduce the impact of compute-overlap on communication time. We’ll provide our perspective on challenges that remain in scaling these models to a larger scale, while still achieving optimal Compute and Network efficiency.