MAY 18, 2022

TorchData and TorchArrow: Data Preprocessing for ML at Production Scale

Wenlei Xie

Vitaly Fedyunin

Yingxin Kang

TOPIC: Data, Systems and Networking

@SCALE SERIES: Data @Scale

TYPE: video

YEAR: 2022

TAGS:

The problem of deep learning and building large scale systems for production is not just one of model training, but data preprocessing as well. At production scale, just the data loading and processing part of the system can cause significant friction and consume your engineers’ time, while still being non-performant as more and more data is used. We provide an overview of the top pain points that are normally faced in this space. With these pain points in mind, we’ve created two libraries that solve different parts of the data workflow, TorchData to make pipeline creation composable, easy to use, and flexible simplifying the path from research to production, and TorchArrow a DataFrame library that allows for scale through the use of high performance execution runtimes built on the Arrow memory format. We’ll step through the out of the box offerings with our open-sourced TorchData and TorchArrow APIs and building blocks, and provide a real world case study that shows how we’ve made data preprocessing performant at scale within Meta. Lastly, we’ll give a peek into upcoming work as we continue to develop and share our learnings with the open source community.

SUBSCRIBE TO @SCALE

TOPICS

Data, Systems and Networking Dev Tools and Ops, Privacy, Sustainability and Performance Fighting Abuse and Security Machine Learning and AI Mobile, Video and Web