Scheduling and Sharding Considerations for Network Efficiency

Designed for engineers that build and manage large-scale networks. Networking solutions are critical for building applications and services that serve billions of people around the world. Building and operating such large-scale networks often present complex engineering challenges to solve.

Arnab Choudhury

Weiwei Chu

TOPIC: Data, Systems and Networking

@SCALE SERIES: Networking @Scale

TYPE: ARTICLE

YEAR: 2024

TAGS:

As Meta continues to drive and rapidly evolve AI models for our products, including AGI, and for the community, how we scale its parameter size, amount of training data, and amount of computation is critical. To scale models, we need to efficiently utilize a huge amount of computational resources.

So, to satisfy the scaling requirements, Meta is building training clusters with efficient network connections. Last year we built a 24K GPU cluster interconnected via ROCE to train Meta’s large-scale models. Since large-language model (LLM) pre-training will take months on many thousands of GPUs, efficient utilization of GPUs is critical. We trained Llama 3 models with 16K GPUs over a period of months; but with that number of GPUs, an efficiency improvement of just a few percentage points will save us months of training time, and that can translate into huge amounts of resource saving. Through taking both the training model and the network into consideration, our highly customized parallelism implementation achieved almost 3X Teraflops per GPU compared to Llama 2.

For Llama 3 405B model pretraining, to hold the model and activation, we will need more than 700 terabytes (TB) of memory. The model is trained on H100 with 80GB memory, and to fit the model on the available GPUs, we need to shard both the model and the activations on many GPUs. With more GPUs, the communication overhead will increase. So to still achieve strong scaling of GPU utilization, it is critical to choose a sharding strategy to reduce the communication overhead.

Let’s briefly go through a few parallelism options we used in Llama training and their implication for communication. There are two types of parallelism: data parallelism and model parallelism. Data parallelism means each group holds the same model parameters but is trained on different data. Model parallelisms means each group holds different model parameters but trains on the same data.

Suppose we have a transformer-based model; each transformer layer consists of several matrix multiplications: Y=XW, where X is input matrix, W is weight matrix, and Y is output matrix.

*Figure 1. Matrix multiplication example: Suppose X has the shape [m, k], and W has the shape [k, n]; then Y will have the shape [m, n]*

*Figure 2. Fully sharded data parallel (FSDP) will shard both weight W and input X. W will be allgathered during computation and resharded after computation.*

Fully sharded data parallel (FSDP) is a type of data parallelism. Suppose we have two GPUs to do distributed training. We will break weight matrix W into W1 and W2 and place them onto the two GPUs respectively. Before we do the matrix multiplication, we first allgather W on both GPUs so that both have the complete W. After matrix multiplication, we will reshard W to two GPUs. Since we can do a prefetch of W, the allgather communication of the next W can be easily hidden by the current computation, as shown in Figure 3.

*Figure 3. Communication of W for layer i+1 can be overlapped with computation of layer i through prefetching.*

Tensor parallelism is a type of model parallelism. In tensor parallelism, pictured in Figure 4, we also split W into two parts, W1 and W2, but we always keep W sharded during matrix multiplication. We feed the same input data X onto two GPUs, then each GPU will get part of the output Y. Then we gather output Y1 and Y2 into Y for both GPUs. Since the next computation will need Y as input, the gathering of Y is hard to hide by computation. There are many such matrix multiplications in each model layer, so the communication data size is big for tensor parallelism.

*Figure 4. Tensor parallelism: W is always sharded, but output Y will be communicated.*

*Figure 5. Tensor parallelism’s computation depends on the output of the last computation, so communication of output is blocking computation.*

Below in Table 1 is a summary of the commonly used parallelisms and their communication requirements. TP’s communication is hard to hide by computation and thus has low tolerance for slow networks. CP and EP are in the middle. PP and FSDP can tolerate slow networks. The choice of parallelism depends on the model, training setting, and hardware. The goal is to fit models into hardware while minimizing the communication overhead. For example, when the allgather of weight in FSDP becomes a bottleneck and cannot be hidden by computation at large scale, we need to use TP to reduce the amount of weight allgathered on each GPU. It is critical that each parallelism makes the best utilization of network bandwidth.

*Table 1. Summary of parallelisms and their communication requirements*

In Llama 3 training, we ordered parallelism according to network topology. (See Figure 6 below.) The order of parallelism dimensions, [TP, CP, PP, DP], is optimized for network communication. The inner parallelism requires the highest network bandwidth and lowest latency. The outer parallelism may spread across a multi-hop network and should tolerate higher network latency. Therefore, based on the requirements for network bandwidth and latency, we place parallelism dimensions in the order of [TP, CP, PP, DP]. TP is the innermost to utilize the scaled-out network. DP (i.e., FSDP) is the outermost parallelism because it can tolerate longer network latency by asynchronously prefetching sharded model weights and reducing gradients.

*Figure 6. Network-aware parallelism in Llama 3*

To select the proper parallelism order for the best network utilization, it is critical to schedule GPUs correctly. For example, TP has the highest requirement for network bandwidth, so the GPUs in the same TP group had better be placed within one server, or at least not across ranks. PP and DP can be placed cross-zone. Not placing GPUs properly can result in super-long communications. Since all GPUs will need to sync with each other in each iteration, one slow group will slow the whole training down. (See Figure 7.)

*Figure 7. Synchronization between all GPUs in Llama 3 training. Any slow group will slow down the whole training.*

Scheduling Considerations

As machine-learning (ML) training models grow larger and larger, unique scheduling considerations arise with respect to network overhead. Specifically, the round-trip time (RTT) between GPU NICs on an RDMA network goes up, depending on the network hops needed to communicate between the NICs. For example, GPU NICs on the same rack will incur the lowest amount of latency, while if we have to communicate between GPUs in different buildings, the RTT will be a lot higher. Figure 8 demonstrates this.

*Figure 8. Latency overhead when communicating between GPUs goes up the further away the GPUs are topologically.*

The increased RTT latency at larger scales can impact the QPS of the model, resulting in non-trivial performance loss. Thus, it becomes imperative to reduce network hop communication overhead between GPU NICs. In order to understand how the scheduling system does this, let us now consider the concept of rank in ML training.

Introducing Ranks

An ML training job at Meta consists of a number of nodes all using various training techniques to train the model. Examples of training techniques include Stochastic Gradient Descent (SGD), Support Vector Machines (SVM), and Mixture of Experts (MOE). In addition to this, as called out above, in Llama 3 we use various parallelism techniques such as Tensor Parallel, Data Parallel and Pipeline Parallel to optimize the model’s performance. In this context, a rank is simply an identifier/index in a process list for an ML training job, with each process running on a different GPU. Ranks with closer indices communicate more often. We thus want to make sure that consecutive ranks are located topologically closer to each other in a GPU RDMA network, thus minimizing communication overhead.

Rank Assignment

Rank assignment is the process of assigning ranks to GPU hosts in an ML training job and is illustrated below in Figure 9.

As called out above, our objective is to ensure that consecutive ranks are located closer to each other in the network topology. Figure 12 below shows what happens when rank assignment is not considered and ranks are randomly assigned to hosts in an ML training job.

*Figure 10. Too much cross-network communication!*

If instead we were to assign Rank 1 to GPU Host 1, Rank 2 to GPU Host 2, Rank 3 to GPU Host 3, and so on, we would have the communication pattern illustrated below in Figure 13. There we can see that cross-zone communication is minimized, resulting in better ML training performance at a large scale.

Additional Topology-Aware Scheduling

As the scale of backend networks becomes larger and larger, rank assignment by itself is not sufficient to ensure optimal performance. Specifically, as called out above, we need to restrict tensor-parallel traffic to be intra-rack and pipeline-parallel traffic to be intra-zone; only data-parallel traffic can be allowed to flow cross-building. ML training is continuously evolving, and rather than expose the various parallelism techniques to the scheduler, one way to ensure the above constraints is to restrict the number of nodes allocated per network-affinity scope. Specifically, MAST (the ML training scheduler used at Meta) is not aware of parallelism techniques, but is instead provided additional topological constraints when scheduling very large jobs. These topological constraints specify multiples of hosts to allocate at every single network-affinity scope—for example, allocating in multiples of two hosts per rack and 128 hosts per data center. Once these topological constraints have been satisfied and the ML training job has been allocated, the training application then arranges ranks such that tensor parallelism occurs only intra-rack, pipeline parallelism only occurs intra-zone, and data parallelism can be allowed to flow cross-building. Figure 14 below depicts this allocation scheme.

*Figure 12. ML training job allocated with multiples of 2 hosts per rack and 128 hosts per DC*

Additional Scheduling Considerations

In addition to network considerations, at a very large scale there are a few additional ones (see Figure 15 below). Those include scheduling considerations regarding how quickly we can restart a job due to a host going down, and fault tolerance considerations: Can parts of the very large model still continue to train if some parts are down? Scheduling considerations, fault tolerance, and network overhead intersect in interesting and critical ways, and we can’t afford to think about each of these problems in isolation.

*Figure 13. Scheduling considerations, fault tolerance, and network overhead intersect in interesting and critical ways, and we must think about each of these problems in relation to the other two.*

Putting It All Together

In this post, we talked about the various parallelism techniques used in ML training for Llama 3. In addition, we highlighted the scheduling considerations involved at a very large scale for ML training jobs. Finally, we highlighted some of the additional scheduling considerations other than network overhead such as scheduling overhead and fault tolerance. We continue to invest in this direction as we build larger and larger GPU RDMA networks at Meta to meet our goals of AGI.