Scheduler and Sharding Considerations for Network Efficiency

Every generation of Large Language models requires an order of magnitude large compute and networking infrastructure. Various model and data parallelism techniques distribute the computational complexity of the model onto infrastructure. Achieving optimal network communication performance requires careful consideration of how models map to network architecture and hierarchy. One of the important layers of the training stack that influences such mapping is the Job scheduler. This talk explores how we customized Meta’s Job Scheduler (MAST) to achieve optimal parallelism to network topology. We will cover how we represented our network hierarchy to be consumed by the scheduler as well as, the specific customizations we performed in order to ensure high bandwidth, latency sensitive communications map to the first few network hops. We will present a recent use case of Llama3 to demonstrate the parallelisms used during training and the impact scheduler modifications have had on performance and network efficiency of such parallelisms.


To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy