TOPIC: Data, Systems and Networking

Data @Scale Boston 2018

OCTOBER 25, 2018 @ 8:30 AM PDT - 6:00 PM PDT
Designed for engineers who are interested in building, operating, and using data systems at scale. Data already enables companies to build products with user empathy, find new market opportunities, understand trends, make better decisions, and ensure that their services and systems stay healthy.


Data @Scale is an invitation-only technical conference for engineers focused on the latest developments and challenges associated with building, operating, and using Data systems at scale. Building services that serve millions or even billions of people presents a set of complex, and often unprecedented, engineering challenges.

Join experts from companies including Google, Hubspot, InfluxData, OM1 and Wayfair to openly discuss these challenges and collaborate on the development of new solutions. We’ll spend the day covering a wide range of topics including database, analytics, streaming, distributed systems, data privacy, and machine learning.

Keep an eye on this page – we’ll continue to add to the lineup as we get closer to the show.


Event times below are displayed in PT.

October 25

08:30 AM - 10:00 AM
Registration & Breakfast
10:00 AM - 10:10 AM
10:10 AM - 10:35 AM
Protecting Patient Privacy While Using Real-World Evidence

The growing availability of healthcare data in digital form is changing clinical research and the way that medicine is practiced. When massive amounts of healthcare data are combined with big data architectures, machine learning, and a great deal of ingenuity, it is now possible to compare treatments and providers based upon their outcomes in real patient populations, and it is increasingly possible predict patient outcomes under different treatment scenarios. But how do we protect the privacy of patients while enabling this?In this talk, I will explore the challenges of working with patient data and present solutions to combining large-scale healthcare datasets while still preserving patient privacy.

Speaker Philip Wickline,OM1
10:35 AM - 11:00 AM
Balancing Flexibility and Control with Database Deployments

Options for database technologies and deployment strategies, whether on-premises or cloud-based, are growing at a significant pace. With this growth often comes the desire to implement organizational constraints around technology choices. In this session we’ll examine the positive and negative impacts of selection constraints, highlighting two contrasting use cases. We’ll also offer recommendations on how to achieve a balance of flexibility and control.

Speaker Tanya Cashorali,TCB Analytics
11:00 AM - 11:25 AM
Lessons and Observations Scaling a Timeseries Database

InfluxData builds a timeseries platform primarily deployed for DevOps and IOT monitoring. This talk presents several lessons learned while scaling the platform across a large number of deployments - from single server opensource instances to highly available high-throughput clusters. This talk presents a number of failure conditions that informed subsequent design choices. I'll discuss designing backpressure in an AP system with 10's of thousands of resource limited writers; trade-offs between monolithic and service-oriented database implementations; and lessons learned implementing multiple query processing systems.

Speaker Ryan Betts,InfluxData
11:25 AM - 11:50 AM
Leveraging Sampling to Reduce Data Warehouse Resource Consumption

The volume of data processed by Facebook’s analytics workload has been rapidly increasing, resulting in greater compute and storage demands. We have been investigating how to use sampling as a technique offset such demand while still providing good approximate query results. In this talk, we discuss the approaches we have been using to support the computation of analytical dashboards using sampling, where approximations result in negligible visual differences of the graphs. We discuss the challenges that this poses to approximate computation, such as the need to consider uncertainty propagation when calculating aggregated metrics. We will also show the benefits in terms of resource consumption, in both compute and storage.

Speaker Gabriela Jacques Da Silva,Facebook
Speaker Donghui Zhang,Facebook
11:50 AM - 12:15 PM
Voting with Witnesses the Apache Cassandra Way

Transient Replication, an adaptation of Witness Replicas, is a mechanism for achieving up to 50% disk space and compute savings in a DBMS that replicates many small records without a distributed log. With Transient Replication we divide the group of voting replicas into nodes that fully replicate a value and nodes that transiently replicate that value. Under non-failure conditions, this allows nodes to avoid retaining values they transiently replicate. Building on Transient Replication we can also implement an optimization, called Cheap Quorums, to avoid writing to transient replicas entirely that helps eliminate read and write load at transient replicas. I will also present mitigations that enable Transient Replication to perform well even under failure conditions.

Speaker Ariel Weisberg,Apache Cassandra
12:15 PM - 01:15 PM
01:15 PM - 01:40 PM
Deleting Data @ Scale

Deletion is critical to helping people control their data. It also has unique technical challenges at scale – for example, managing deletion across distributed systems and building in mechanisms to confirm completeness and accuracy. This talk will cover Facebook’s Deletion Framework, a system we built to automatically detect gaps, ensure completeness, and make sure the correct data is deleted.

Speaker Ben Strahs,Facebook
01:40 PM - 02:05 PM
Scaling Data Plumbing at Wayfair

Wayfair is an e-commerce company in the home goods space, whose annual revenue has grown ~10x in 6 years. The corresponding data infrastructure’s growth is superlinear. Ben Clark will describe an early-stage hodge-podge of under-engineered components, traffic patterns that started to looked like trouble, and four decisions to invest in writing or modifying data plumbing components in C, C++ and Rust. The latest of these is ‘Tremor,’ a traffic shaper and router that is replacing logstash and regulating the flow of data through our fast-growing network of Kafka clusters and destination data stores.

Speaker Ben Clark,Wayfair
02:05 PM - 02:30 PM
Kubeflow: Portable Machine Learning on Kubernetes

Kubeflow, a framework on Kubernetes that supports the full lifecycle of an ML application, addresses three of the biggest development challenges: scalability, portability, and composability. It provides a single, unified tool for running common processes such as model training, evaluation, and serving, as well as monitoring, logging, and other operational tools. It supports the use of the entire stack locally, on-premises, or on the cloud platform of your choice, including specialized hardware such as GPUs. By reducing variability between services and environments, Kubeflow enables applications that are more robust and resilient, resulting in less downtime, quality issues, and customer impact.

Speaker Michelle Casbon,Google
02:30 PM - 02:55 PM
How DataXu Built a Cloud-Native Warehouse

At DataXu, we manage digital advertising campaigns at scale — collecting and processing large volumes of log data daily, and delivering reports that track campaign performance metrics on an hourly basis. To achieve this, we successfully migrated multiple databases and warehouses to the cloud. Today, our reporting warehouse supports all major business workflows, ranging from daily impression/activity reporting for customers to business metrics for internal teams. DataXu’s “cloud native” warehouse architecture was an early user of Glue Data Catalog, Athena (Presto-as-a-service), Lambda and serverless infrastructure on AWS. In this talk, we will share lessons learned in our multi-year journey to the cloud.

Speaker Suchi Raman,DataXu
02:55 PM - 03:20 PM
Migrating Elasticsearch Instances at Scale

Since the passing of the General Data Protection Regulation (GDPR), many companies needed to increase the security of their data systems. In order to easily migrate teams to new, secure clusters, the HubSpot Data Infrastructure team wrote a managed ingestion pipeline using Kafka that enables them to seamlessly migrate teams' indices from one cluster to another with minimal effort on the teams' behalf. Patrick will discuss the ingestion pipeline, the migration process, and how it leverages Elasticsearch features to make the migration seamless for HubSpot's other development teams.

Speaker Patrick Dignan,HubSpot
03:20 PM - 03:50 PM
03:50 PM - 04:15 PM
Presto: Pursuit of Performance

Presto is an open source distributed SQL engine widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Inspired by increasingly complex SQL queries, engineers at Facebook and Starburst have recently focused on cost-based query optimization. In the first part of this talk, we will present the design and implementation of the Presto cost-based optimizer (CBO) to support connector-provided statistics, estimate selectivity, and choose efficient query plans. In the second part of the talk, we will discuss a new mechanism in Presto that computes statistics seamlessly and efficiently making all Presto-generated data ready for CBO without any extra manual steps. Finally, we will discuss our future work enhancing the CBO and statistics collection in Presto.

Speaker Andrii Rosa,Facebook
Speaker Matt Fuller,Starburst
04:15 PM - 04:35 PM
Palisade: Overload Protection for Data Analytics and Storage Systems

With the growth of edge services, we face the challenge of safely collecting large volumes of data at low latency for data processing and analytics systems. In this talk, we present Palisade, an overload protection system for reporting and analytics data processing infrastructures. Palisade is built on a flexible stream processing system called Akamill that provides buffering, connection pooling, and transformations to support near-real-time data collection for applications. We focus on the continuous collection of traffic summaries and the control system that smooths out large dynamic traffic spikes with minimal overhead.

Speaker Aniruddha Bohra,Akamai
04:35 PM - 05:00 PM
Building Highly Reliable Data Pipelines at Datadog

Every day at Datadog, we run massive data pipelines to power core functionality. In this talk, we’ll cover best practices we use at Datadog to ensure that we reliably deliver this functionality while processing trillions of points/day in the face of exponential data growth, hardware failures, corrupt data, and even human error.

Speaker Jeremy Karn,Datadog
05:00 PM - 05:05 PM
Closing Remarks
05:05 PM - 06:00 PM
Networking Happy Hour


Philip Wickline


Tanya Cashorali

TCB Analytics

Ryan Betts


Gabriela Jacques Da Silva


Donghui Zhang


Ariel Weisberg

Apache Cassandra

Ben Strahs


Ben Clark


Michelle Casbon


Suchi Raman


Patrick Dignan


Andrii Rosa


Matt Fuller


Aniruddha Bohra


Jeremy Karn

UPCOMING EVENT   May 22, 2024 Data @Scale

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...
UPCOMING EVENT   June 12, 2024 Systems @Scale

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...
UPCOMING EVENT   07/31/2024 AI @Scale

AI Infra @Scale 2024

Meta's Engineering and Infrastructure teams are excited to host AI Infra @Scale, a one-day virtual event featuring a range of speakers from Meta who will unveil the latest AI infrastructure investments and innovations powering Meta's...
UPCOMING EVENT   August 7, 2024 Product @Scale

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. This year focuses on discussions that explore the creator ecosystem, and how AI will play a role in scaling...
UPCOMING EVENT   September 4-5, 2024 (2 day event) Networking @Scale

Networking @Scale 2024

Networking @Scale is a technical conference for engineers that build and manage large-scale networks. Meta’s Networking Infrastructure team is excited to host Networking @Scale, a two-day virtual event featuring a range of speakers from Meta...
UPCOMING EVENT   September 25, 2024 Reliability @Scale

Reliability @Scale 2024

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems and products at massive scale. Whether it’s novel design decisions, or outages that impact billions...
UPCOMING EVENT   October 23, 2024 Mobile @Scale

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...
UPCOMING EVENT   November 20, 2024 Video @Scale

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...
PAST EVENT   March 20, 2024 @ 9am PT - 3pm PT RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy