Inference Deployments and Comms Implication

Cen Zhao

Xiaodong Wang

Jianyu Huang

TOPIC: Systems and Networking

@SCALE SERIES: Systems and Networking

TYPE: video

YEAR: 2025

TAGS:

This talk addresses the challenges and solutions for scaling large language model (LLM) inference to support up to 1 billion monthly active users across platforms for Meta AI, focusing on compute-bound prefill and memory-bound decode stages. Key challenges include the quadratic scaling of attention operations with sequence length and the linear growth of the KV cache, along with network-intensive operations impacting latency. To enhance scaling efficiency, a multi-dimensional parallelism strategy is proposed across various hardware platforms, including Nvidia and AMD. Innovations such as Context Parallelism (CP) and iRoPE enable near-linear prefill scaling, while optimized communication techniques like Dynamic/Persistent All-to-All for Expert Parallelism (EP) and Direct Data Access (DDA) for Tensor Parallelism (TP) significantly improve performance. Future efforts aim to further enhance system efficiency through fused kernels and device-initiated operations.

SUBSCRIBE TO @SCALE

Go back

Your message has been sent