Bringing Llama 3 to Life – LLaMa Inference at Meta

Ye (Charlotte) Qia

TOPIC: Machine Learning and AI

@SCALE SERIES: AI Infra @Scale

TYPE: video

YEAR: 2024

TAGS:

Optimizing and scaling LLM inference is crucial for enabling large-scale product applications at reasonable cost. This presentation will introduce key parallelism techniques that help scale model sizes and context windows, which in turn influence inference system designs. Additionally, we will discuss practical challenges associated with deploying these complex serving paradigms throughout our internal cloud to our data center of heterogeneous hardware, including the need for multi-faceted trade-offs when facing large-scale and dynamic real-world loads.