A CASE STUDY IN BRIDGING PRODUCTION SOFTWARE AND DATA PRACTICES FOR LLM MODEL TRAINING USING SNOWFLAKE

Nathan Wiegand

Snowflake

Kelvin So

Snowflake

TOPIC: Data, Machine Learning and AI

@SCALE SERIES: Data, Machine Learning and AI

TYPE: video

YEAR: 2024

TAGS:

We present a case study detailing the utilization of Snowflake, a cloud-based data platform, in various stages of the LLM data pipeline from initial annotation to LLM model productionization. The system we have built brings production software and data practices to the field of LLM model training. We describe how every step in the system is built, including data annotation, filtering, global deduplication, decontamination and tokenization. We show how the data engineering capabilities of a cloud warehouse like Snowflake can be used to enhance data exploration and LLM data ablations and experimentation. A key aspect of LLM productionisation that we cover involves incorporating data lineage tracking, facilitated by output cards at each stage of the data pipelines, ensuring transparency and traceability throughout the LLM model development lifecycle.

SUBSCRIBE TO @SCALE

Go back