June 17, 2024

MEGASCALE: SCALING LARGE LANGUAGE MODEL TRAINING TO MORE THAN 10,000 GPUS

Topic: Systems and Networking

Haibin Lin

Bytedance

TYPE: Videos

YEAR: 2024

In this presentation, I will discuss the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We developed a set of diagnostic tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. We share our operational experience in identifying and fixing failures and stragglers.

SUBSCRIBE TO @SCALE

← Back

MEGASCALE: SCALING LARGE LANGUAGE MODEL TRAINING TO MORE THAN 10,000 GPUS

Haibin Lin

TYPE: Videos

YEAR: 2024

SUBSCRIBE TO @SCALE

Thank you for your response. ✨

RECENT POSTS

RELATED POSTS