Making Data Quality an integral part of developing Machine Learning and Data Products
“Machine Learning models are only as good as the data that was used to train them. Datasets are often plagued with problems such as quality, discoverability, and undesirable social biases. As data and modeling tools are becoming more accessible, tools to maintain auditability, data lineage, and reproducibility have not caught up. Ignoring these concerns affect data and model quality and will only compound as the amount of available training data grows. Growing datasets incur additional costs and impact productivity due to a lack of tools that promote re-use and sharing of these computations.
In this talk we will introduce two open source products –
Flyte: A platform for orchestrating Machine Learning and Data Workflows. It is built on core tenets of Reproducibility, Efficiency and Auditability.
Pandera: A programmatic statistical typing and data testing tool for scientific and analytics data containers.
Together these can drastically improve the workflow of a user and address data quality requirements throughout the ML/Data product development lifecycle.
Flyte was built to be type-safe to promote the re-use of computations across an organization. This was modeled similar to a Service oriented API design, so that teams could offer data transformations as a service. Flyte tasks definitions use typed inputs and outputs, which permits the platform to statically verify and reason about a workflow. The approach combined with immutable versioning permits reusable task computation. Furthermore, pre-computed outputs can be leveraged to save costs and time. When combined with Pandera, it brings quality guarantees throughout the development process.
This talk will conclude with a demo and concrete steps for attendees on how they could leverage either of these products to deploy quality ML & data products.”