Training data for Meta’s recommendation systems was entirely stored in Data Warehouse, structured as relational tables where each row captures labels and snapshotted features at the point of recommendation.
New modeling techniques, such as learning from user sequences and multi-modality, has led to a 10-100x increase in feature size, making the training data increasingly cost-prohibitive due to high duplication. The same user’s features are stored repeatedly for every recommendation request, with highly popular content features being duplicated potentially over a million times.
We present a co-designed data and infrastructure in order to address the scaling challenge. By moving features out of training samples into a high-performance indexing storage and implementing model access pattern-aware pushdown optimizations, we have achieved a 10x storage cost reduction for the largest feature: long user sequences.