TOPIC: Data, Systems and Networking

Data @Scale Spring 2022

MAY 18, 2022 @ 9:00 AM PDT - 2:00 PM PDT

Designed for engineers who are interested in building, operating, and using data systems at scale. Data already enables companies to build products with user empathy, find new market opportunities, understand trends, make better decisions, and ensure that their services and systems stay healthy.

RSVPS CLOSED

AGENDA SPEAKERS

ABOUT EVENT

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Data already enables companies to build products with user empathy, find new market opportunities, understand trends, make better decisions, and ensure that their services and systems stay healthy. The landscape of data systems is quickly evolving, and, especially at extreme scale, imposes unique and complex engineering challenges.

This year’s Data @Scale will be focused on the new challenges that Machine Learning presents for data infrastructure.

Speakers from Meta and other industry-leading companies will discuss how they are tackling these challenges today, and we hope the event fosters a community that can discuss and collaborate on the development of practical industry solutions together.

The conference will be hosted virtually on May 18th starting at 9 AM PT, and will feature keynote sessions, tech talks and Q&A sessions.

Event times below are displayed in PT.

EVENT AGENDA

Event times below are displayed in PT.

May 18

09:00 AM - 09:05 AM

Opening Remarks

Speaker Barak Yagour,Meta

09:05 AM - 09:20 AM

Keynote

WATCH NOW

Speaker Aparna Ramani,Meta

09:20 AM - 09:50 AM

Automated Model Update & Evaluation

WATCH NOW

This talk breaks down stage-by-stage requirements and challenges for online prediction and fully automated, on-demand continual learning. We’ll also discuss key design decisions a company might face when building or adopting a machine learning platform for online prediction and continual learning use cases.

Speaker Chip Huyen,Voltron Data

09:50 AM - 10:05 AM

Real-Time Data Processing for ML Feature Engineering

WATCH NOW

In Meta, we had developed multiple real-time data processing infrastructure like Puma, Stylus and Turbine (SIGMOD '16 and ICDE '20). As Meta grows, the needs for real-time data has grown way beyond traditional data analytics & reporting scenarios. Recently, ML data engineering become increasingly a strong driving force. The real-time data is no longer only examined by human occasionally, but powers ML-based systems to always gain the freshest knowledge and make higher quality predictions. We will talk about the architecture of our latest generation, consolidated real-time data processing platform and how we evolve it for ML real-time feature engineering.

Speaker Weiran Liu,Meta

Speaker Ping Chen,Meta

10:05 AM - 10:15 AM

Scalable Data Transportation & Ingestion with MemQ

WATCH NOW

Machine learning is at the heart of Pinterest and is powered by large scale ML training log collection. To solve the cost efficient data ingestion & transportation problem at Pinterest we developed MemQ, a PubSub system that leverages pluggable cloud native storage like S3 using a decoupled packet based storage design. MemQ is able to scale to GB/s traffic with 90% higher cost efficiency than Apache Kafka, enabling Pinterest to ingest all of our ML training data powering offline training, near real-time model quality validation and ad-hoc analysis.

Speaker Ambud Sharma,Pinterest

10:15 AM - 10:30 AM

Break

Grab a coffee and come back at 10:50 AM!

10:30 AM - 10:50 AM

Keynote

WATCH NOW

Speaker Ion Stoica,AnyScale

10:50 AM - 11:10 AM

Industrial-Scale Machine Learning with Amazon SageMaker

WATCH NOW

Coming Soon!

Speaker Andrea Olgiati,AWS

11:10 AM - 11:35 AM

ML Monitoring & Observability @Meta Scale

WATCH NOW

ML generates significant value for Meta’s infrastructure, tools, products, and users. It drives a varied set of insights; from end-user products such as recommendations and feeds on Facebook and Instagram, to infrastructure insights for demand prediction and capacity planning. However, problems such as gradient explosions, data corruption, feature coverage and multi-layer performance degradations impact the ML ecosystem. As features, data and models scale, the nature of these problems gets more complex to assess impact, root cause and mitigate — especially with siloed tools, teams and metadata, fragmented and manual run books — spread across the ML lifecycle. In this talk, we provide an overview of ML Challenges at Meta, our take on ML monitoring and observability infrastructure and tooling to solve for these problems. We cover an overview of our platform, use cases, and product experiences.

Speaker Partha Kanuparthy,Meta

Speaker Animesh Dalakoti,Meta

Speaker Kunal Bhalla,Facebook

11:35 AM - 11:55 AM

Enabling Machine Learning through Real-Time Data Processing using Rockset

WATCH NOW

Data Infrastructure has evolved in the last 15 years from Hadoop's batch system, to streaming systems like Spark and Kafka and now to realtime systems like Rockset and Clickhouse. Automatic decision making based on massive data sets demands a data infrastructure system that is Real-Time. These decisions are made by either hand crafted rules or served by machine learned models that operate on large datasets and return results in milliseconds. We dive into the design and architecture of one such realtime data processing platform named Rockset. Rockset is a Real-Time indexing database that powers fast SQL over semi-structured data such as JSON, Parquet, or XML without requiring any schematization. All data loaded into Rockset are automatically indexed and a fully featured SQL engine powers fast queries over semi-structured data without requiring any database tuning. Rockset uses open source RocksDB as it storage engine, In this talk, we discuss some of the key design aspects of Rockset such as: * Smart Schema: Smart Schemas can take any semistructured dataset with deeply nested objects and arrays and automatically turn it into a SQL table. This becomes especially important to serve Machine Learning Models in production when the models frequently create new columns or change schema of existing columns. We show how this feature reduces the need for data cleaning or data preparation before data can be used to generate insights or serve models in production. * Converged indexing: A novel storage format (unlike Parquet or ORC), that is built for millisecond latency on massive data sets. This format builds multiple indices including an inverted index, a column index, a row indes, a range index, a time index etc with minimal overhead. This allows model serving to operate on large, fast changing datasets because a query automatically picks the best index to use, thereby making it faster than brute-force scan based systems. * The Aggregator Leaf Tailer architecture: A novel systems architecture that implements a three-way disaggregation among storage, query compute and ingest compute. We describe a novel way to embed User Defined Functions (UDF) written in JavaScript as part of any SQL query. We visualize that UDFs be used to implement Machine Learning models like kNN and Faiss to serve models in production. We describe how Rockset uses SIMD instructions in a vectorized engine to improve query performance and draw a similarity to how machine-learning training infrastructure can leverage a similar approach. We explain how Rockset manages the on-disk format of data with automatic splitting of RocksDB based column based clusters for better compressions and faster decoding, a technique that can be used by general purpose machine learning training infrastructure as well.

Speaker Dhruba Borthakur,Rockset

11:55 AM - 12:30 PM

Lunch

Grab a bite and come back at 12:30PM

12:30 PM - 01:00 PM

Fireside Chat with Aparna Ramani & Yann LeCun

WATCH NOW

Speaker Aparna Ramani,Meta

Speaker Yann LeCun,Meta

01:00 PM - 01:20 PM

TorchData and TorchArrow: Data Preprocessing for ML at Production Scale

WATCH NOW

The problem of deep learning and building large scale systems for production is not just one of model training, but data preprocessing as well. At production scale, just the data loading and processing part of the system can cause significant friction and consume your engineers’ time, while still being non-performant as more and more data is used. We provide an overview of the top pain points that are normally faced in this space. With these pain points in mind, we’ve created two libraries that solve different parts of the data workflow, TorchData to make pipeline creation composable, easy to use, and flexible simplifying the path from research to production, and TorchArrow a DataFrame library that allows for scale through the use of high performance execution runtimes built on the Arrow memory format. We’ll step through the out of the box offerings with our open-sourced TorchData and TorchArrow APIs and building blocks, and provide a real world case study that shows how we’ve made data preprocessing performant at scale within Meta. Lastly, we’ll give a peek into upcoming work as we continue to develop and share our learnings with the open source community.

Speaker Wenlei Xie,Meta

Speaker Vitaly Fedyunin,Meta

Speaker Yingxin Kang,Meta

01:20 PM - 01:35 PM

Making Data Quality an integral part of developing Machine Learning and Data products

WATCH NOW

Machine Learning models are only as good as the data that was used to train them. Datasets are often plagued with problems such as quality, discoverability, and undesirable social biases. As data and modeling tools are becoming more accessible, tools to maintain auditability, data lineage, and reproducibility have not caught up. Ignoring these concerns affect data and model quality and will only compound as the amount of available training data grows. Growing datasets incur additional costs and impact productivity due to a lack of tools that promote re-use and sharing of these computations. In this talk we will introduce two open source products – Flyte: A platform for orchestrating Machine Learning and Data Workflows. It is built on core tenets of Reproducibility, Efficiency and Auditability. Pandera: A programmatic statistical typing and data testing tool for scientific and analytics data containers. Together these can drastically improve the workflow of a user and address data quality requirements throughout the ML/Data product development lifecycle. Flyte was built to be type-safe to promote the re-use of computations across an organization. This was modeled similar to a Service oriented API design, so that teams could offer data transformations as a service. Flyte tasks definitions use typed inputs and outputs, which permits the platform to statically verify and reason about a workflow. The approach combined with immutable versioning permits reusable task computation. Furthermore, pre-computed outputs can be leveraged to save costs and time. When combined with Pandera, it brings quality guarantees throughout the development process. This talk will conclude with a demo and concrete steps for attendees on how they could leverage either of these products to deploy quality ML & data products.

Speaker Ketan Umare,Union.AI

Speaker Katrina Rogan,Union.AI

01:35 PM - 01:55 PM

Minimize Risks and Accelerate MLOps with Model Performance Monitoring and Explainability

WATCH NOW

We’re truly living under the rule of Algorithms, our day-to-day activities from news consumption, job search, and mortgage financing are increasingly being decided by algorithms. Most of these algorithms are AI-based and are increasingly black-box to humans. If we continue to let these algorithms operate the way they do today, in the black box and without human oversight, it would result in a dystopian view of the world where unfair decisions are made by unseen algorithms operating in the unknown. Therefore it is critical that we build trust between AI and humans. In this talk, we will learn about how we can do this by continuously monitoring AI for performance and bias issues and sharing these insights across teams to build a culture of trust in the organization. Fiddler works with Fortune 500 companies to enable responsible AI and regulatory compliance of AI algorithms.

Speaker Krishna Gade,Fiddler

01:55 PM - 02:00 PM

Closing Remarks

WATCH NOW

Speaker Barak Yagour,Meta

SPEAKERS AND MODERATORS

Barak Yagour is a Director of Engineering at Meta, responsible for Data Infrastructure. Over... read more

Barak Yagour

Meta

Aparna Ramani is VP of Engineering at Meta, responsible for Data, Developer and AI... read more

Aparna Ramani

Meta

Chip Huyen works to accelerate data analytics on GPUs at Voltron Data. She also... read more

Chip Huyen

Voltron Data

Weiran leads the Stream Processing team at Meta powering real-time data applications in a... read more

Weiran Liu

Meta

Ping works on Ads machine learning in Meta, and leads the effort of building... read more

Ping Chen

Meta

Ambud Sharma is a tech lead at Pinterest. He has worked on architecting, stabilizing,... read more

Ambud Sharma

Ion Stoica is a Professor in the EECS Department at the University of California... read more

Ion Stoica

AnyScale

Andrea is a Senior Principal Engineer at Amazon Web Services. He is the Chief... read more

Andrea Olgiati

AWS

Partha Kanuparthy is a Software Engineer in the Monitoring area at Meta. His work... read more

Partha Kanuparthy

Meta

Animesh Dalakoti is a Product Manager in the ML Monitoring and Observability space at... read more

Animesh Dalakoti

Meta

Kunal Bhalla is a software engineer working on improving the developer experience for ML... read more

Kunal Bhalla

Facebook

Dhruba Borthakur is CTO and co-founder of Rockset, the Real-Time Analytics company where he... read more

Dhruba Borthakur

Rockset

Yann LeCun is VP & Chief AI Scientist at Meta and Silver Professor at... read more

Yann LeCun

Meta

Wenlei is a Research Scientist working on PyTorch in Meta. He is excited to... read more

Wenlei Xie

Meta

Vitaly Fedyunin is a Software Engineer at Meta, where he works on PyTorch Data... read more

Vitaly Fedyunin

Meta

Yingxin Kang is a software engineer from the ML Data Platform team at Meta,... read more

Yingxin Kang

Meta

Ketan Umare is the TSC Chair for Flyte (incubating under LF AI & Data).... read more

Ketan Umare

Union.AI

Katrina Rogan is software engineer at Union.ai who works on the open source Flyte... read more

Katrina Rogan

Union.AI

Krishna Gade is the Founder/CEO of Fiddler.AI, a Model Performance Monitoring startup. Prior to... read more

Krishna Gade

Fiddler

UPCOMING EVENT JULY 31, 2024 @ 2:30 PM PDT - 7:00 PM PDT - IN PERSON EVENT | AUGUST 7, 2024 @ 2:30 PM PDT - 5:30 PM PDT - VIRTUAL PROGRAM AI @Scale

AI Infra @Scale 2024

Meta’s Engineering and Infrastructure teams are excited to return for the second year in a row to host AI Infra @Scale on July 31. This year’s event is open to a limited number of in-person...

UPCOMING EVENT August 14, 2024 Product @Scale

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. The @Scale community focuses on bringing forward people's experiences in creating innovative solutions to large-scale products serving millions or...

UPCOMING EVENT September 11, 2024 | Santa Clara Convention Center Networking @Scale

Networking @Scale 2024

Meta’s Networking team invites you to Networking@scale on September 11th. . This year’s event is an in-person event hosted at the Santa Clara Convention center and will also be live streamed for virtual attendees. Registration...

UPCOMING EVENT October 9, 2024 Reliability @Scale

Reliability @Scale 2024

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems and products at massive scale. Whether it’s novel design decisions, or outages that impact billions...

UPCOMING EVENT October 23, 2024 Mobile @Scale

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...

UPCOMING EVENT November 20, 2024 Video @Scale

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...

PAST EVENT March 20, 2024 @ 9am PT - 3pm PT RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

Past EVENT May 22, 2024 Data @Scale

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...

Past EVENT June 12, 2024 Systems @Scale

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...

FIND @SCALE TOPICS

Dev Tools and Ops, Privacy, Sustainability and Performance Fighting Abuse and Security Machine Learning and AI Mobile, Video and Web

Data @Scale Spring 2022

ABOUT EVENT

EVENT AGENDA

May 18

May 18

SPEAKERS AND MODERATORS

Barak Yagour

Aparna Ramani

Chip Huyen

Weiran Liu

Ping Chen

Ambud Sharma

Ion Stoica

Andrea Olgiati

Partha Kanuparthy

Animesh Dalakoti

Kunal Bhalla

Dhruba Borthakur

Yann LeCun

Wenlei Xie

Vitaly Fedyunin

Yingxin Kang

Ketan Umare

Katrina Rogan

Krishna Gade

AI Infra @Scale 2024

Product @Scale 2024

Networking @Scale 2024

Reliability @Scale 2024

Mobile @Scale 2024

Video @Scale 2024

RTC @Scale 2024

Data @Scale 2024

Systems @Scale 2024

FIND @SCALE TOPICS

EXPLORE OTHER SERIES

Networking @Scale

Reliability @Scale

Systems @Scale