TOPIC: Annual @Scale Conference

The @Scale Conference 2018

SEPTEMBER 13, 2018 @ 8:30 AM PDT - 6:30 PM PDT
The annual @Scale conference is an invite-only, in-person conference hosted in the fall of each year in the Bay Area, California. The event invites engineers and industry leaders from all over the world to come together and discuss leading @Scale topics and network with their peers. Attendees will be able to participate in several thought provoking general sessions and 4 separate tracks highlighting the year's themes.
RSVPS CLOSED
AGENDA SPEAKERS

ABOUT EVENT

This year’s event features technical deep dives from engineers at a multitude of scale companies. Adobe, Amazon, Cloudera, Cockroach Labs, Facebook, Google, Microsoft, NASA, NVIDIA, Pinterest, and Uber are scheduled to appear. The day has three tracks: Data, Dev Tools & Ops, and Machine Learning. We also have our scale system demonstration area back for a third year so engineers can interact with the latest tech. Keep an eye on this page – we’ll add videos of the presentations soon.

EVENT AGENDA

Event times below are displayed in PT.

September 13

08:30 AM - 10:00 AM
Registration and Breakfast
09:00 AM - 10:00 AM
Women's Leadership Breakfast
Speaker Surabhi Gupta,Airbnb
Speaker Shan He,Uber
Speaker Rachel Peterson,Meta
10:00 AM - 10:30 AM
Keynote: Golden Age for Computer Architecture

The end of Dennard scaling and Moore’s law are not problems that must be solved but facts that, if accepted, offer breathtaking opportunities. High-level, domain-specific languages and architectures aided by open source ecosystems and agilely developed chips will accelerate progress in machine learning. We envision a new golden age for computer architecture in the next decade, with dramatic gains in cost and energy security as well as in performance.

Speaker David Patterson,Google and UC Berkeley
10:30 AM - 11:00 AM
Keynote: A Community-Driven Approach to AI Infrastructure
Speaker Jason Taylor,Facebook
11:00 AM - 11:30 AM
Keynote: Inside NVIDIA’s End-to-End AI Infrastructure for Autonomous Driving

In this talk, we’ll discuss our production-level, end-to-end infrastructure and workflows to develop AI for self-driving cars. We’ll explore the platform that supports continuous data ingest from multiple cars (each producing TBs of data per hour) and enables autonomous AI designers to iterate training new neural network designs across thousands of GPU systems and validate their behavior over multi PB-scale data sets. The obstacles faced by self-driving cars aren’t limited to the world of autonomous driving. We will share how the problems we’ve solved in building this infrastructure for training and inference at scale are applicable to others deploying AI-based services.

Speaker Clément Farabet,Nvidia
11:35 AM - 12:55 PM
Lunch
12:55 PM - 01:25 PM
Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases

Amazon Aurora is a relational database service for OLTP workloads offered as part of Amazon Web Services (AWS). In this talk, we describe the architecture of Aurora and the design considerations leading to that architecture. We believe the central constraint in high throughput data processing has moved from compute and storage to the network. Aurora brings a novel architecture to the relational database to address this constraint, most notably by pushing redo processing to a multitenant scale-out storage service, purpose-built for Aurora. We describe how doing so not only reduces network traffic but also allows for fast crash recovery, failovers to replicas without loss of data, and fault-tolerant, self-healing storage. Traditional implementations that leverage distributed storage would use distributed consensus algorithms for commits, reads, replication, and membership changes, and amplify the cost of underlying storage. We will describe how Aurora avoids distributed consensus under most circumstances by establishing invariants and leveraging local transient state. These techniques improve performance, reduce variability, and lower costs.

Speaker Sailesh Krishnamurthy,Amazon Web Services
12:55 PM - 01:25 PM
Machine Learning: Friends Don't Let Friends Deploy Black-Box Models: The Importance of Intelligibility in Machine Learning

In machine learning, often a trade-off must be made between accuracy and intelligibility: The most accurate models usually are not very intelligible (e.g., deep nets), and the most intelligible models usually are less accurate (e.g., linear regression). This trade-off often limits the accuracy of models that can safely be deployed in mission-critical applications such as health care where being able to understand, validate, edit, and ultimately trust a learned model is important. I have been developing a learning method based on generalized additive models (GAMs) that is often as accurate as full complexity models but even more intelligible than linear models. This makes it easy to understand what a model has learned, and also makes it easier to edit the model when it learns inappropriate things because of unanticipated problems with the data. Making it possible for experts to understand a model and repair it is critical because most data has unanticipated landmines. In the talk I present a case study where these high-accuracy GAMs discover surprising patterns in data that would have made deploying a black-box model risky. I also briefly show how we are using these models to detect bias in domains where fairness and transparency are paramount.

Speaker Rich Caruana,Microsoft Research
12:55 PM - 01:25 PM
Dev Tools & Ops: Automated Fault-Finding with Sapienz at Facebook

Sapienz designs system tests that simulate user interactions with mobile apps. It automatically finds apps' crashes, then localizes, tracks, and triages them to developers. This talk will cover how Sapienz is deployed at a large scale at Facebook, including its continuous integration with Facebook's development process, fault signals boosted by Infer's static analysis, and cross-platform testing on both Android and iOS.

Speaker Ke Mao,Facebook
01:30 PM - 02:00 PM
Presto: Fast SQL on Everything

Presto is an open source, high-performing, distributed relational database system targeted at making SQL analytics over big data fast and easy at Facebook. It provides rich SQL language capabilities for data engineers, data scientists, and business analysts to quickly and interactively process terabytes to petabytes of data. Presto is widely used within at Facebook for interactive analytics with over thousands of active users. We're using Presto to accelerate a massive batch pipeline workload in our Hive Warehouse. Presto is also used to support custom analytics workloads with low-latency and high throughput requirements. As an open source project, Presto has been adopted externally by many companies, including Comcast, LinkedIn, Netflix, and Walmart. In addition, Presto is being offered as a managed service by vendors such as Amazon, Qubole, and Starburst Data. In this talk, we’ll outline a selection of use cases that Presto supports at Facebook, describe its architecture, and discuss several features that enable it to support these use cases.

Speaker Vaughn Washington,Facebook
01:30 PM - 02:00 PM
MLPerf: A Suite of Benchmarks for Machine Learning

The MLPerf effort aims to build a common set of benchmarks that enables the machine learning field to measure system performance for both training and inference from mobile devices to cloud services. We believe that a widely accepted benchmark suite will benefit the entire community, including researchers, developers, builders of machine learning frameworks, cloud service providers, hardware manufacturers, application providers, and end users.

Speaker Cliff Young,Google
01:30 PM - 02:00 PM
Testing Strategy with Multi-App Orchestration

Uber's dual app experience introduces an interesting challenge for mobile functional testing. This talk introduces methods of doing mobile E2E testing while orchestrating multiple apps. We will discuss the pros and cons of each method and how we scaled them to run with speed and stability.

Speaker Leslie Lei,Uber
02:05 PM - 02:35 PM
Resource Management at Scale for SQL Analytics

Apache Impala is a highly popular open source SQL interface built for for large-scale data warehouses. Impala has been deployed in production at over 800 enterprise customers as part of Cloudera Enterprise, managing warehouses up to 40 PB in size. HDFS, cloud object stores, and scalable columnar storage engines make it cheap and easy to store large volumes of data in one place rather than spread across many siloes. This data attracts queries and, soon enough, contention for resources arises between different queries, workloads, and organizations. Without resource management policies and enforcement, critical queries can’t run and users can’t interactively query the data. This talk will discuss the challenges in making resource management work at scale for SQL analytics and how we are tackling them in Apache Impala.

Speaker Tim Armstrong,Cloudera
02:05 PM - 02:35 PM
Applied Machine Learning at Facebook: An Infrastructure Perspective

Machine learning sits at the core of many essential products and services at Facebook. This talk describes the hardware and software infrastructure that supports machine learning at global scale. Facebook machine learning workloads are extremely diverse: Services require many different types of models in practice. This diversity has implications at all layers in the system stack. In addition, a sizable fraction of all data stored at Facebook flows through machine learning pipelines, presenting significant challenges in delivering data to high-performance distributed training flows. Computational requirements are also intense, leveraging both GPU and CPU platforms for training and abundant CPU capacity for real-time inference. Addressing these and other emerging challenges continues to require diverse efforts that span machine learning algorithms, software, and hardware design.

Speaker Kim Hazelwood,Facebook
02:05 PM - 02:35 PM
Regression Testing Against Real Traffic on a Big Data Reporting System

Adobe Analytics’ underlying query API handles thousands of complex reporting queries per second across a user base of hundreds of thousands. Our engineering team has significantly increased the quality and frequency of our releases over the past few years by mirroring production API traffic to our test builds and running regression analysis that compares the API responses of our production and test builds. Our novel approach allows us to test millions of API permutations using real traffic, without “mocking” any portion of the system and without also affecting the availability of our production systems.

Speaker Trenton Davies,Adobe Systems Inc.
02:40 PM - 03:10 PM
Goku - Pinterest’s In-House Time-Series Database

Goku is a highly scalable, cost-effective and high-performant online, time-series database service. It stores and serves a massive amount of time-series data without losing granularity. Goku can write tens of millions of data points per second and retrieve millions of data points within tens of milliseconds. It supports high compression ratio, downsampling, interpolation, and multidimensional aggregation. It can be used in a wide range of monitoring tasks, including production safety and IoT. It can also be used for real-time analytics that make use of time-series data.

Speaker Tian-Ying Chang,Pinterest
Speaker Jinghan Xu,Pinterest
02:40 PM - 03:10 PM
Distributed AI with Ray

Over the past decade, the bulk synchronous processing model has been proven highly effective for processing large amounts of data. However, today we are witnessing the emergence of a new class of applications — AI workloads. These applications exhibit new requirements, such as nested parallelism and highly heterogeneous computations. To support such workloads, we have developed Ray, a distributed system, which provides both task-parallel and actor abstractions. Ray is highly scalable, employing an in-memory storage system and a distributed scheduler. In this talk, I will discuss some of our design decisions and the early experience with using Ray to implement a variety of applications.

Speaker Ion Stoica,AnyScale
02:40 PM - 03:10 PM
One World: Scalable Resource Management

As Facebook's user base and family of applications grows, we need to ensure correctness and performance across many hardware and software platforms. Managing all these combinations for testing would be an operational impossibility for small teams focused on a particular service or feature. To make testing scalable and reliable, we built a resource management system called One World to allow teams to dependably request the platforms required via a unified API, no matter its type or location.

Speaker Evan Snyder,Facebook
03:10 PM - 03:40 PM
Office Hours
03:40 PM - 04:10 PM
Run Your Database Like a CDN

Modern businesses serve customers around the globe, but few manage to avoid sending far-flung customer requests across an ocean (or two!) to application servers colocated with a centralized database. This presents two significant problems: high latencies and conflicts with evolving data sovereignty regulations. CockroachDB is a distributed SQL database that solves the problem of global scale using a combination of features, including geo-replication, geo-partitioning, and data interleaving, which together allow a customer's data to stay in close-proximity while still enjoying strong, single-copy consistency. This talk will briefly introduce CockroachDB and then explore how it is able to achieve low latency and precise data domiciling in an example global use case.

Speaker Peter Mattis,Cockroach Labs
03:40 PM - 04:10 PM
Accelerate Machine Learning at Scale Using Amazon SageMaker

Organizations are using machine learning to address a series of business challenges, such as recommendations, demand forecasting, customer churn, and medical research. The process of machine learning includes framing the problem statement, data collection and preparation, training and tuning, and deploying the models. In this session, we will talk about how Amazon SageMaker removes the barriers and complexity associated with building, training, and deploying machine learning models at scale to address a wide range of use cases.

Speaker Vladimir Zhukov,AWS
03:40 PM - 04:10 PM
Machine Learning Testing at Scale

Machine Learning is infused all around us, including a lot of Google products such as Google Home, Search, Gmail, and more, as well as in systems such as those used by self-driving cars and fraud detection systems. A tremendous amount of effort is being made to improve people’s experiences using products throughout the industry, where products are powered by machine learning and AI. However, developing and deploying high-quality, robust ML systems at Google's scale is hard. This can be due to many factors, including but not limited to distributed ownership, training serving skew, maintaining privacy and proper access controls of data, model freshness, and compatibility. In the face of such challenges, we started an ML productivity effort to empower developers to move quickly and launch with confidence. This effort encompasses building infrastructure for reliability and reusability of software as well as extraction of critical ML metrics that can be monitored to make informed decisions through the ML life cycle. In this talk, we will discuss a few examples where these efforts may be applicable.

Speaker Manasi Joshi,Google Inc.
04:15 PM - 04:45 PM
Managing Datastore Locality at Scale with Akkio

Akkio is a locality management service layered between client applications and distributed data store systems. It determines how and when to migrate data to reduce response times and resource usage. Akkio primarily targets multi-data-center geo-distributed datastore systems. Its design was motivated by the observation that many of Facebook’s frequently accessed data sets have low R/W ratios and not well served by distributed caches or full replication. Akkio’s unit of migration is called a µ-shard. Each µ-shard is designed to contain related data with some degree of access locality. At Facebook, µ-shards have become a first-class abstraction. Akkio went into production at Facebook in 2014, and it currently manages approximately 100 PB of data. Akkio is portable: It currently runs on five data store systems.

Speaker Victoria Dudin,Facebook
04:15 PM - 04:45 PM
Computer Vision at Scale as Cloud Services

As computer vision gets mature and ready for real-world applications, we set on a mission to scale and democratize it via cloud services. Starting by integrating some of the latest computer vision work from Microsoft Research, we quickly learned that building such a service at scale requires not only state-of-the-art algorithms but also deep care of customer demands. In this talk, we will walk through some of the challenges we faced, including data privacy, deep customization, and bias correction, and discuss solutions we have built to tackle these challenges.

Speaker Cha Zhang,Microsoft Corporation
04:15 PM - 04:45 PM
ISSTAC Scalability Challenges

Attacks relying on the space-time complexity of algorithms implemented by software systems are gaining prominence. Software systems are vulnerable to such attacks if an adversary can inexpensively generate inputs that cause the system to consume an impractically large amount of time or space to process those inputs, thus denying service to benign users or disabling the system. The adversary can also use the same inputs to mount side-channel attacks that aim to infer some secret from the observed space-time system behavior. Our project, ISSTAC: Integrated Symbolic Execution for Space-Time Analysis of Code, has developed automated analysis techniques and has implemented them in an industrial-strength tool that allows the efficient analysis of software (in the form of Java bytecode) with respect to space-time complexity vulnerabilities. The analysis is based on symbolic execution, a well-known analysis technique that systematically explores program execution paths and also generates inputs that trigger those paths. I will give an overview of the project and highlight scalability challenges and how we addressed them in our project.

Speaker Corina Pasareanu,NASA
04:50 PM - 05:20 PM
Scaled Machine Learning Platform at Uber

Michelangelo is the machine learning platform that we have built at Uber. The purpose of Michelangelo is to enable data scientists and engineers to easily build, deploy, and operate machine learning solutions at scale. It is designed to be ML-as-a-service, covering the end-to-end machine learning workflow: manage data, train models, evaluate models, deploy models, make predictions, and monitor predictions. Michelangelo supports traditional ML models, time series forecasting, and deep learning. In this talk, I will cover some of the key ML use cases at Uber, the main Michelangelo components and workflows, and some of the newer areas that we are developing.

Speaker Jeremy Hermann,Uber
04:50 PM - 05:20 PM
Artificial Intelligence at Orbital Insight

Orbital Insight is a geospatial big data company leveraging the rapidly growing availability of satellite, UAV, and other geospatial data sources to understand and characterize socio-economic trends at global, regional, and hyperlocal scales. This talk discusses the satellite imagery domain, how it’s evolving, and the various advantages and challenges of working with such imagery. You will see several example applications demonstrating how machine learning is disrupting this space.

Speaker Adam Kraft,Orbital Insight
04:50 PM - 05:20 PM
Scaling Concurrency Bug Detection with the Infer Static Analyser

Concurrency is hard and inevitable, given the evolution of computing hardware. Helping programmers avoid the exotic and messy bugs that come with parallelism can be a productivity multiplier but has been elusive. Implementing such a service via static analysis and at the scale of Facebook may sound too good to be true. In this talk, I will discuss our efforts to catch data races, deadlocks, and other concurrency pitfalls by deploying two analyzers based on Facebook Infer that comment at code review time, giving programmers early feedback.

Speaker Nikos Gorogiannis,Facebook London
05:30 PM - 06:30 PM
Networking Happy Hour and Office Hours

SPEAKERS AND MODERATORS

Surabhi leads Engineering for Airbnb's Homes business. Her teams are responsible for guest and... read more

Surabhi Gupta

Airbnb

Shan is a senior data visualization engineer at Uber. She is a coder, a... read more

Shan He

Uber

As Vice President for Data Center Strategy, Rachel Peterson oversees Meta’s global infrastructure expansion... read more

Rachel Peterson

Meta

David Patterson is a computer science professor emeritus at UC Berkeley, a distinguished engineer... read more

David Patterson

Google and UC Berkeley

Jason Taylor

Facebook

Clement Farabet is VP of AI Infrastructure at NVIDIA. His team is responsible for... read more

Clément Farabet

Nvidia

Sailesh Krishnamurthy is a General Manager at Amazon Web Services where he runs engineering,... read more

Sailesh Krishnamurthy

Amazon Web Services

Rich Caruana is a Principal Researcher at Microsoft. Before joining Microsoft, Rich was on... read more

Rich Caruana

Microsoft Research

Ke is a software engineer at Facebook. Prior to Facebook, he was the CTO... read more

Ke Mao

Facebook

Vaughn is a 15+ year veteran of the tech industry and a graduate of... read more

Vaughn Washington

Facebook

Cliff Young is a member of the Google Brain team, whose mission is to... read more

Cliff Young

Google

Leslie Lei is a software engineer at Uber, leading the Mobile Test Infrastructure team,... read more

Leslie Lei

Uber

Tim likes to build high-performance systems for scale-out computation. Tim currently works on Apache... read more

Tim Armstrong

Cloudera

Kim Hazelwood leads the AI Infrastructure Foundation efforts at Facebook, which focus on the... read more

Kim Hazelwood

Facebook

Trent Davies is a Principal Scientist at Adobe Systems Inc., where he has worked... read more

Trenton Davies

Adobe Systems Inc.

Dr. Tian-Ying Chang is a Sr. staff engineer and manager of Storage and Caching... read more

Tian-Ying Chang

Pinterest

Jinghan has been tech leading Pinterest user growth for the past few years and... read more

Jinghan Xu

Pinterest

Ion Stoica is a Professor in the EECS Department at the University of California... read more

Ion Stoica

AnyScale

Evan is a Production Engineer for Developer Infrastructure at Facebook. She relocated to London... read more

Evan Snyder

Facebook

Peter has been in the software industry for just over two decades. During his... read more

Peter Mattis

Cockroach Labs

Vladimir Zhukov

AWS

Manasi Joshi is a director of software engineering at Google, where she is currently... read more

Manasi Joshi

Google Inc.

Victoria Dudin is an Engineering Manager in Core Data organization at Facebook. The projects... read more

Victoria Dudin

Facebook

Cha Zhang

Microsoft Corporation

Corina Pasareanu is an Associate Research Professor at Carnegie Mellon University CyLab, working at... read more

Corina Pasareanu

NASA

Jeremy Hermann leads the Machine Learning Platform team at Uber. Before Uber, he led... read more

Jeremy Hermann

Uber

Adam Kraft comes from Orbital Insight, where he works on large scale computer vision... read more

Adam Kraft

Orbital Insight

Nikos is currently working with the Infer team at Facebook London, helping develop the... read more

Nikos Gorogiannis

Facebook London
UPCOMING EVENT   May 22, 2024 Data @Scale

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...
UPCOMING EVENT   June 12, 2024 Systems @Scale

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...
UPCOMING EVENT   07/31/2024 AI @Scale

AI Infra @Scale 2024

Meta's Engineering and Infrastructure teams are excited to host AI Infra @Scale, a one-day virtual event featuring a range of speakers from Meta who will unveil the latest AI infrastructure investments and innovations powering Meta's...
UPCOMING EVENT   August 7, 2024 Product @Scale

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. This year focuses on discussions that explore the creator ecosystem, and how AI will play a role in scaling...
UPCOMING EVENT   September 4-5, 2024 (2 day event) Networking @Scale

Networking @Scale 2024

Networking @Scale is a technical conference for engineers that build and manage large-scale networks. Meta’s Networking Infrastructure team is excited to host Networking @Scale, a two-day virtual event featuring a range of speakers from Meta...
UPCOMING EVENT   September 25, 2024 Reliability @Scale

Reliability @Scale 2024

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems and products at massive scale. Whether it’s novel design decisions, or outages that impact billions...
UPCOMING EVENT   October 23, 2024 Mobile @Scale

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...
UPCOMING EVENT   November 20, 2024 Video @Scale

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...
PAST EVENT   March 20, 2024 @ 9am PT - 3pm PT RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy