JUNE 12, 2024


In this case study, we present the system used to train the Arctic MoE model at Snowflake. The system uses a combination of Snowflake and Kubernetes for the entire lifecycle of Large Language Model (LLM) training, ranging from the initial stages of data acquisition and processing—including annotation, filtering, and deduplication—to conducting data ablation experiments and executing large-scale model training. Our approach leverages Snowflake for its robust data governance, lineage tracking, and cloud warehouse capabilities, alongside the versatile CPU and GPU compute resources orchestrated through Kubernetes. This symbiosis not only streamlines the model development process but also enhances efficiency and scalability by optimizing resource allocation and utilization: a cluster of GPU nodes and a Snowflake instance is all you need to do model training from scratch. Through this unified framework, we demonstrate a seamless, end-to-end solution that accelerates LLM training workflows, ensuring both high performance and adherence to data governance standards.

Live remarks will be presented by Jeff Rasley and Lawrence Moore. The post event video on demand will feature Jeff Rasley and Hyungtae Kim.

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy