Inference Deployments and Comms Implication

This talk addresses the challenges and solutions for scaling large language model (LLM) inference to support up to 1 billion monthly active users across platforms for Meta AI, focusing on compute-bound prefill and memory-bound decode stages. Key challenges include the quadratic scaling of attention operations with sequence length and the linear growth of the KV cache, along with network-intensive operations impacting latency. To enhance scaling efficiency, a multi-dimensional parallelism strategy is proposed across various hardware platforms, including Nvidia and AMD. Innovations such as Context Parallelism (CP) and iRoPE enable near-linear prefill scaling, while optimized communication techniques like Dynamic/Persistent All-to-All for Expert Parallelism (EP) and Direct Data Access (DDA) for Tensor Parallelism (TP) significantly improve performance. Future efforts aim to further enhance system efficiency through fused kernels and device-initiated operations.

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy