Alibaba HPN: A Data Center Network for Large Language Model Training

Due to the differences between LLMs and general cloud computing (e.g., in terms of traffic patterns and fault tolerance), traditional data center networks are not well-suited for LLM training. Unlike general cloud computing which generates millions of small flows (e.g., lower than 10Gbps), LLM training produces a small number of periodic, bursty flows (e.g., 400Gbps) on each host. HPN introduces a 2-tier, dual-plane architecture capable of interconnecting 15K GPUs within one Pod, typically accommodated by the traditional 3-tier Clos architecture. Such a new architecture design not only avoids hash polarization, but also greatly reduces the search space for path selection, thus allowing us to precisely select network paths capable of holding elephant flows. HPN also employs a dual-ToR design to avoid the single point of failure problem. We share our experience in motivating, designing, and building HPN, as well as the operational lessons of HPN in production.


To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy