This talk addresses the challenges and solutions for scaling large language model (LLM) inference to support up to 1 billion monthly active users across platforms for Meta AI, focusing on compute-bound prefill and memory-bound decode stages. Key challenges include the quadratic scaling of attention operations with sequence length and the linear growth of the KV cache, along with network-intensive operations impacting latency. To enhance scaling efficiency, a multi-dimensional parallelism strategy is proposed across various hardware platforms, including Nvidia and AMD. Innovations such as Context Parallelism (CP) and iRoPE enable near-linear prefill scaling, while optimized communication techniques like Dynamic/Persistent All-to-All for Expert Parallelism (EP) and Direct Data Access (DDA) for Tensor Parallelism (TP) significantly improve performance. Future efforts aim to further enhance system efficiency through fused kernels and device-initiated operations.
- WATCH NOW
- 2025 EVENTS
- PAST EVENTS
- 2024
- 2023
- 2022
- February
- RTC @Scale 2022
- March
- Systems @Scale Spring 2022
- April
- Product @Scale Spring 2022
- May
- Data @Scale Spring 2022
- June
- Systems @Scale Summer 2022
- Networking @Scale Summer 2022
- August
- Reliability @Scale Summer 2022
- September
- AI @Scale 2022
- November
- Networking @Scale Fall 2022
- Video @Scale Fall 2022
- December
- Systems @Scale Winter 2022
- 2021
- 2020
- 2019
- 2018
- 2017
- 2016
- 2015
- Blog & Video Archive
- Speaker Submissions