TRAINING ARCTIC AT SNOWFLAKE

Hyungtae Kim

Snowflake

Jeff Rasley

Snowflake

TOPIC: Systems and Networking

@SCALE SERIES: Systems and Networking

TYPE: video

YEAR: 2024

TAGS:

In this case study, we present the system used to train the Arctic MoE model at Snowflake. The system uses a combination of Snowflake and Kubernetes for the entire lifecycle of Large Language Model (LLM) training, ranging from the initial stages of data acquisition and processing—including annotation, filtering, and deduplication—to conducting data ablation experiments and executing large-scale model training. Our approach leverages Snowflake for its robust data governance, lineage tracking, and cloud warehouse capabilities, alongside the versatile CPU and GPU compute resources orchestrated through Kubernetes. This symbiosis not only streamlines the model development process but also enhances efficiency and scalability by optimizing resource allocation and utilization: a cluster of GPU nodes and a Snowflake instance is all you need to do model training from scratch. Through this unified framework, we demonstrate a seamless, end-to-end solution that accelerates LLM training workflows, ensuring both high performance and adherence to data governance standards.

Live remarks will be presented by Jeff Rasley and Lawrence Moore. The post event video on demand will feature Jeff Rasley and Hyungtae Kim.

SUBSCRIBE TO @SCALE

← Back