SCALABLE SOLUTIONS FOR RUNNING LARGE LANGUAGE MODELS

Jiaxin Cao

Lepton AI

TOPIC: Data, Systems and Networking

@SCALE SERIES: Systems @Scale

TYPE: video

YEAR: 2024

TAGS:

The advent of open-source large language models like Llama and Mixtral demands innovative deployment strategies for efficiency and cost-effectiveness. We will explore adaptive workload management for infrastructure optimization, crucial for handling varying demands efficiently. Next, we will delve into LLM caching techniques, including sticky routing and prompt caching, to enhance response times and optimize system utilization. Additionally, we’ll discuss strategies designed to mitigate system pressure during spikes in traffic. These strategies collectively aim to enhance the scalability and efficiency of AI platforms in the era of advanced LLMs.