Scheduler and Sharding Considerations for Network Efficiency

Weiwei Chu

Arnab Choudhury

TOPIC: Data, Systems and Networking

@SCALE SERIES: Networking @Scale

TYPE: video

YEAR: 2024

TAGS:

Every generation of Large Language models requires an order of magnitude large compute and networking infrastructure. Various model and data parallelism techniques distribute the computational complexity of the model onto infrastructure. Achieving optimal network communication performance requires careful consideration of how models map to network architecture and hierarchy. One of the important layers of the training stack that influences such mapping is the Job scheduler. This talk explores how we customized Meta’s Job Scheduler (MAST) to achieve optimal parallelism to network topology. We will cover how we represented our network hierarchy to be consumed by the scheduler as well as, the specific customizations we performed in order to ensure high bandwidth, latency sensitive communications map to the first few network hops. We will present a recent use case of Llama3 to demonstrate the parallelisms used during training and the impact scheduler modifications have had on performance and network efficiency of such parallelisms.