Bringing Llama 3 to Life – LLaMa Inference at Meta

Optimizing and scaling LLM inference is crucial for enabling large-scale product applications at reasonable cost. This presentation will introduce key parallelism techniques that help scale model sizes and context windows, which in turn influence inference system designs. Additionally, we will discuss practical challenges associated with deploying these complex serving paradigms throughout our internal cloud to our data center of heterogeneous hardware, including the need for multi-faceted trade-offs when facing large-scale and dynamic real-world loads.


To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy