Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM | Jared Casper

Jared Casper

NVIDIA

TOPIC: Systems and Networking

@SCALE SERIES: Systems and Networking

TYPE: video

YEAR: 2022

TAGS:

In this talk we present how we trained a 530B parameter language model on a DGX SuperPOD with over 3,000 A100 GPUs and a high speed Infiniband interconnect, and how we can scale to even larger models. We explore three types of parallelism: data, tensor, and pipeline and how these different types can be composed to achieve maximum efficiency. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs (per-GPU throughput of 52% of theoretical peak). We discuss challenges that we faced when training the 530B Megatron-Turing NLG model and give practical advice on how to successfully train very large language models.

SUBSCRIBE TO @SCALE

Go back