Alibaba HPN: A Data Center Network for Large Language Model Training

Jiaqi Gao

Alibaba

TOPIC: Data, Systems and Networking

@SCALE SERIES: Networking @Scale

TYPE: video

YEAR: 2024

TAGS:

Due to the differences between LLMs and general cloud computing (e.g., in terms of traffic patterns and fault tolerance), traditional data center networks are not well-suited for LLM training. Unlike general cloud computing which generates millions of small flows (e.g., lower than 10Gbps), LLM training produces a small number of periodic, bursty flows (e.g., 400Gbps) on each host. HPN introduces a 2-tier, dual-plane architecture capable of interconnecting 15K GPUs within one Pod, typically accommodated by the traditional 3-tier Clos architecture. Such a new architecture design not only avoids hash polarization, but also greatly reduces the search space for path selection, thus allowing us to precisely select network paths capable of holding elephant flows. HPN also employs a dual-ToR design to avoid the single point of failure problem. We share our experience in motivating, designing, and building HPN, as well as the operational lessons of HPN in production.