The @Scale Conference

San Jose Convention Center 8:30am - 6:30pm

Event Completed

The 2018 @Scale Conference is an invitation-only technical event for engineers who work on large-scale platforms and technologies. Building applications and services that scale to millions or even billions of people presents a complex set of engineering challenges, many of them unprecedented. The @Scale community is focused on bringing people together to openly discuss these challenges and collaborate on the development of new solutions.

This year’s event features technical deep dives from engineers at a multitude of scale companies. Adobe, Amazon, Cloudera, Cockroach Labs, Facebook, Google, Microsoft, NASA, NVIDIA, Pinterest, and Uber are scheduled to appear. The day has three tracks: Data, Dev Tools & Ops, and Machine Learning. We also have our scale system demonstration area back for a third year so engineers can interact with the latest tech. Keep an eye on this page – we’ll add videos of the presentations soon.

Join the @Scale Community page to receive updates and check out the Videos & Articles page to see videos of the talks from the 2018 @Scale Conference and other @Scale events.

Read More Read Less

@Scale brings thousands of engineers together throughout the year to discuss complex engineering challenges and to work on the development of new solutions. We're committed to providing a safe and welcoming environment — one that encourages collaboration and sparks innovation.

Every @Scale event participant has the right to enjoy his or her experience without fear of harassment, discrimination, or condescension. The @Scale code of conduct outlines the behavior that we support and don't support at @Scale events and conferences. We expect participants to follow these rules at all @Scale event venues, online communities, and event-related social activities. These guidelines will keep the @Scale community a safe and enjoyable one for everyone.

Be welcoming. Everyone is welcome at @Scale events, inclusive of (but not limited to) gender, gender identity or expression, sexual orientation, body size, differing abilities, ethnicity, national origin, language, religion, political beliefs, socioeconomic status, age, color and neurodiversity. We have a zero-tolerance policy for discrimination.

Choose your words carefully. Treat one another with respect and in a professional manner. We're here to collaborate. Conflict is not part of the equation.

Know where the line is, and don't cross it. Harassment, threats, or intimidation of any kind will not be tolerated. This includes verbal, physical, sexual (such as sexualized imagery on clothing, presentations, in print, or onscreen), written, or any other form of aggression (whether outright, subtle, or micro). Behavior that is offensive, as determined by @Scale organizers, security staff, or conference management, will not be tolerated. Participants who are asked to stop a behavior or an action are expected to comply immediately or will be asked to leave.

Don't be afraid to call out bad behavior. If you're the target of harmful or offensive behavior, or if you witness someone else being harassed, threatened, or intimidated, don't look away. Tell an @Scale staff member, a security staff member, or a conference organizer immediately. Please notify our event staff, security staff, or conference organizers of any harmful or offensive behavior that you've experienced or witnessed in any form, whether in person or online.

We at @Scale want our events to be safe for everyone, and we have a zero-tolerance policy for violations of our code of conduct. @Scale conference organizers will investigate any allegation of problematic behavior, and we will respond accordingly. We reserve the right to take any follow-up actions we determine are needed. These include being warned, being refused admittance, being ejected from the conference with no refund, and being banned from future @Scale events.

Event Completed
Agenda
Filter by Track:
  • Keynote
  • Data
  • Machine Learning
  • Dev Tools & Ops
8:30am - 10:00am

Registration and Breakfast

9:00am - 10:00am

Women's Leadership Breakfast

10:00am - 10:30am

KeynoteGolden Age for Computer Architecture

The end of Dennard scaling and Moore’s law are not problems that must be solved but facts that, if accepted, offer breathtaking opportunities. High-level, domain-specific languages and architectures aided by open source ecosystems and agilely developed chips will accelerate progress in machine learning. We envision a new golden age for computer architecture in the next decade, with dramatic gains in cost and energy security as well as in performance.
10:30am - 11:00am

KeynoteA Community-Driven Approach to AI Infrastructure

11:00am - 11:30am

KeynoteInside NVIDIA’s End-to-End AI Infrastructure for Autonomous Driving

In this talk, we’ll discuss our production-level, end-to-end infrastructure and workflows to develop AI for self-driving cars. We’ll explore the platform that supports continuous data ingest from multiple cars (each producing TBs of data per hour) and enables autonomous AI designers to iterate training new neural network designs across thousands of GPU systems and validate their behavior over multi PB-scale data sets. The obstacles faced by self-driving cars aren’t limited to the world of autonomous driving. We will share how the problems we’ve solved in building this infrastructure for training and inference at scale are applicable to others deploying AI-based services.
11:35am - 12:55pm

Lunch

12:55pm - 1:25pm

DataAmazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases

Amazon Aurora is a relational database service for OLTP workloads offered as part of Amazon Web Services (AWS). In this talk, we describe the architecture of Aurora and the design considerations leading to that architecture. We believe the central constraint in high throughput data processing has moved from compute and storage to the network. Aurora brings a novel architecture to the relational database to address this constraint, most notably by pushing redo processing to a multitenant scale-out storage service, purpose-built for Aurora. We describe how doing so not only reduces network traffic but also allows for fast crash recovery, failovers to replicas without loss of data, and fault-tolerant, self-healing storage. Traditional implementations that leverage distributed storage would use distributed consensus algorithms for commits, reads, replication, and membership changes, and amplify the cost of underlying storage. We will describe how Aurora avoids distributed consensus under most circumstances by establishing invariants and leveraging local transient state. These techniques improve performance, reduce variability, and lower costs.
12:55pm - 1:25pm

Machine LearningFriends Don't Let Friends Deploy Black-Box Models: The Importance of Intelligibility in Machine Learning

In machine learning, often a trade-off must be made between accuracy and intelligibility: The most accurate models usually are not very intelligible (e.g., deep nets), and the most intelligible models usually are less accurate (e.g., linear regression). This trade-off often limits the accuracy of models that can safely be deployed in mission-critical applications such as health care where being able to understand, validate, edit, and ultimately trust a learned model is important. I have been developing a learning method based on generalized additive models (GAMs) that is often as accurate as full complexity models but even more intelligible than linear models. This makes it easy to understand what a model has learned, and also makes it easier to edit the model when it learns inappropriate things because of unanticipated problems with the data. Making it possible for experts to understand a model and repair it is critical because most data has unanticipated landmines. In the talk I present a case study where these high-accuracy GAMs discover surprising patterns in data that would have made deploying a black-box model risky. I also briefly show how we are using these models to detect bias in domains where fairness and transparency are paramount.
12:55pm - 1:25pm

Dev Tools & OpsAutomated Fault-Finding with Sapienz at Facebook

Sapienz designs system tests that simulate user interactions with mobile apps. It automatically finds apps' crashes, then localizes, tracks, and triages them to developers. This talk will cover how Sapienz is deployed at a large scale at Facebook, including its continuous integration with Facebook's development process, fault signals boosted by Infer's static analysis, and cross-platform testing on both Android and iOS.
1:30pm - 2:00pm

DataPresto: Fast SQL on Everything

Presto is an open source, high-performing, distributed relational database system targeted at making SQL analytics over big data fast and easy at Facebook. It provides rich SQL language capabilities for data engineers, data scientists, and business analysts to quickly and interactively process terabytes to petabytes of data. Presto is widely used within at Facebook for interactive analytics with over thousands of active users. We're using Presto to accelerate a massive batch pipeline workload in our Hive Warehouse. Presto is also used to support custom analytics workloads with low-latency and high throughput requirements. As an open source project, Presto has been adopted externally by many companies, including Comcast, LinkedIn, Netflix, and Walmart. In addition, Presto is being offered as a managed service by vendors such as Amazon, Qubole, and Starburst Data. In this talk, we’ll outline a selection of use cases that Presto supports at Facebook, describe its architecture, and discuss several features that enable it to support these use cases.
1:30pm - 2:00pm

Machine LearningMLPerf: A Suite of Benchmarks for Machine Learning

The MLPerf effort aims to build a common set of benchmarks that enables the machine learning field to measure system performance for both training and inference from mobile devices to cloud services. We believe that a widely accepted benchmark suite will benefit the entire community, including researchers, developers, builders of machine learning frameworks, cloud service providers, hardware manufacturers, application providers, and end users.
1:30pm - 2:00pm

Dev Tools & OpsTesting Strategy with Multi-App Orchestration

Uber's dual app experience introduces an interesting challenge for mobile functional testing. This talk introduces methods of doing mobile E2E testing while orchestrating multiple apps. We will discuss the pros and cons of each method and how we scaled them to run with speed and stability.
2:05pm - 2:35pm

DataResource Management at Scale for SQL Analytics

Apache Impala is a highly popular open source SQL interface built for for large-scale data warehouses. Impala has been deployed in production at over 800 enterprise customers as part of Cloudera Enterprise, managing warehouses up to 40 PB in size. HDFS, cloud object stores, and scalable columnar storage engines make it cheap and easy to store large volumes of data in one place rather than spread across many siloes. This data attracts queries and, soon enough, contention for resources arises between different queries, workloads, and organizations. Without resource management policies and enforcement, critical queries can’t run and users can’t interactively query the data. This talk will discuss the challenges in making resource management work at scale for SQL analytics and how we are tackling them in Apache Impala.
2:05pm - 2:35pm

Machine LearningApplied Machine Learning at Facebook: An Infrastructure Perspective

Machine learning sits at the core of many essential products and services at Facebook. This talk describes the hardware and software infrastructure that supports machine learning at global scale. Facebook machine learning workloads are extremely diverse: Services require many different types of models in practice. This diversity has implications at all layers in the system stack. In addition, a sizable fraction of all data stored at Facebook flows through machine learning pipelines, presenting significant challenges in delivering data to high-performance distributed training flows. Computational requirements are also intense, leveraging both GPU and CPU platforms for training and abundant CPU capacity for real-time inference. Addressing these and other emerging challenges continues to require diverse efforts that span machine learning algorithms, software, and hardware design.
2:05pm - 2:35pm

Dev Tools & OpsRegression Testing Against Real Traffic on a Big Data Reporting System

Adobe Analytics’ underlying query API handles thousands of complex reporting queries per second across a user base of hundreds of thousands. Our engineering team has significantly increased the quality and frequency of our releases over the past few years by mirroring production API traffic to our test builds and running regression analysis that compares the API responses of our production and test builds. Our novel approach allows us to test millions of API permutations using real traffic, without “mocking” any portion of the system and without also affecting the availability of our production systems.
2:40pm - 3:10pm

DataGoku - Pinterest’s In-House Time-Series Database

Goku is a highly scalable, cost-effective and high-performant online, time-series database service. It stores and serves a massive amount of time-series data without losing granularity. Goku can write tens of millions of data points per second and retrieve millions of data points within tens of milliseconds. It supports high compression ratio, downsampling, interpolation, and multidimensional aggregation. It can be used in a wide range of monitoring tasks, including production safety and IoT. It can also be used for real-time analytics that make use of time-series data.
2:40pm - 3:10pm

Machine LearningDistributed AI with Ray

Over the past decade, the bulk synchronous processing model has been proven highly effective for processing large amounts of data. However, today we are witnessing the emergence of a new class of applications — AI workloads. These applications exhibit new requirements, such as nested parallelism and highly heterogeneous computations. To support such workloads, we have developed Ray, a distributed system, which provides both task-parallel and actor abstractions. Ray is highly scalable, employing an in-memory storage system and a distributed scheduler. In this talk, I will discuss some of our design decisions and the early experience with using Ray to implement a variety of applications.
2:40pm - 3:10pm

Dev Tools & OpsOne World: Scalable Resource Management

As Facebook's user base and family of applications grows, we need to ensure correctness and performance across many hardware and software platforms. Managing all these combinations for testing would be an operational impossibility for small teams focused on a particular service or feature. To make testing scalable and reliable, we built a resource management system called One World to allow teams to dependably request the platforms required via a unified API, no matter its type or location.
3:10pm - 3:40pm

Office Hours

3:40pm - 4:10pm

DataRun Your Database Like a CDN

Modern businesses serve customers around the globe, but few manage to avoid sending far-flung customer requests across an ocean (or two!) to application servers colocated with a centralized database. This presents two significant problems: high latencies and conflicts with evolving data sovereignty regulations. CockroachDB is a distributed SQL database that solves the problem of global scale using a combination of features, including geo-replication, geo-partitioning, and data interleaving, which together allow a customer's data to stay in close-proximity while still enjoying strong, single-copy consistency. This talk will briefly introduce CockroachDB and then explore how it is able to achieve low latency and precise data domiciling in an example global use case.
3:40pm - 4:10pm

Machine LearningAccelerate Machine Learning at Scale Using Amazon SageMaker

Organizations are using machine learning to address a series of business challenges, such as recommendations, demand forecasting, customer churn, and medical research. The process of machine learning includes framing the problem statement, data collection and preparation, training and tuning, and deploying the models. In this session, we will talk about how Amazon SageMaker removes the barriers and complexity associated with building, training, and deploying machine learning models at scale to address a wide range of use cases.
3:40pm - 4:10pm

Dev Tools & OpsMachine Learning Testing at Scale

Machine Learning is infused all around us, including a lot of Google products such as Google Home, Search, Gmail, and more, as well as in systems such as those used by self-driving cars and fraud detection systems. A tremendous amount of effort is being made to improve people’s experiences using products throughout the industry, where products are powered by machine learning and AI. However, developing and deploying high-quality, robust ML systems at Google's scale is hard. This can be due to many factors, including but not limited to distributed ownership, training serving skew, maintaining privacy and proper access controls of data, model freshness, and compatibility. In the face of such challenges, we started an ML productivity effort to empower developers to move quickly and launch with confidence. This effort encompasses building infrastructure for reliability and reusability of software as well as extraction of critical ML metrics that can be monitored to make informed decisions through the ML life cycle. In this talk, we will discuss a few examples where these efforts may be applicable.
4:15pm - 4:45pm

DataManaging Datastore Locality at Scale with Akkio

Akkio is a locality management service layered between client applications and distributed data store systems. It determines how and when to migrate data to reduce response times and resource usage. Akkio primarily targets multi-data-center geo-distributed datastore systems. Its design was motivated by the observation that many of Facebook’s frequently accessed data sets have low R/W ratios and not well served by distributed caches or full replication. Akkio’s unit of migration is called a µ-shard. Each µ-shard is designed to contain related data with some degree of access locality. At Facebook, µ-shards have become a first-class abstraction. Akkio went into production at Facebook in 2014, and it currently manages approximately 100 PB of data. Akkio is portable: It currently runs on five data store systems.
4:15pm - 4:45pm

Machine LearningComputer Vision at Scale as Cloud Services

As computer vision gets mature and ready for real-world applications, we set on a mission to scale and democratize it via cloud services. Starting by integrating some of the latest computer vision work from Microsoft Research, we quickly learned that building such a service at scale requires not only state-of-the-art algorithms but also deep care of customer demands. In this talk, we will walk through some of the challenges we faced, including data privacy, deep customization, and bias correction, and discuss solutions we have built to tackle these challenges.
4:15pm - 4:45pm

Dev Tools & OpsISSTAC Scalability Challenges

Attacks relying on the space-time complexity of algorithms implemented by software systems are gaining prominence. Software systems are vulnerable to such attacks if an adversary can inexpensively generate inputs that cause the system to consume an impractically large amount of time or space to process those inputs, thus denying service to benign users or disabling the system. The adversary can also use the same inputs to mount side-channel attacks that aim to infer some secret from the observed space-time system behavior. Our project, ISSTAC: Integrated Symbolic Execution for Space-Time Analysis of Code, has developed automated analysis techniques and has implemented them in an industrial-strength tool that allows the efficient analysis of software (in the form of Java bytecode) with respect to space-time complexity vulnerabilities. The analysis is based on symbolic execution, a well-known analysis technique that systematically explores program execution paths and also generates inputs that trigger those paths. I will give an overview of the project and highlight scalability challenges and how we addressed them in our project.
4:50pm - 5:20pm

DataScaled Machine Learning Platform at Uber

Michelangelo is the machine learning platform that we have built at Uber. The purpose of Michelangelo is to enable data scientists and engineers to easily build, deploy, and operate machine learning solutions at scale. It is designed to be ML-as-a-service, covering the end-to-end machine learning workflow: manage data, train models, evaluate models, deploy models, make predictions, and monitor predictions. Michelangelo supports traditional ML models, time series forecasting, and deep learning. In this talk, I will cover some of the key ML use cases at Uber, the main Michelangelo components and workflows, and some of the newer areas that we are developing.
4:50pm - 5:20pm

Machine LearningArtificial Intelligence at Orbital Insight

Orbital Insight is a geospatial big data company leveraging the rapidly growing availability of satellite, UAV, and other geospatial data sources to understand and characterize socio-economic trends at global, regional, and hyperlocal scales. This talk discusses the satellite imagery domain, how it’s evolving, and the various advantages and challenges of working with such imagery. You will see several example applications demonstrating how machine learning is disrupting this space.
4:50pm - 5:20pm

Dev Tools & OpsScaling Concurrency Bug Detection with the Infer Static Analyser

Concurrency is hard and inevitable, given the evolution of computing hardware. Helping programmers avoid the exotic and messy bugs that come with parallelism can be a productivity multiplier but has been elusive. Implementing such a service via static analysis and at the scale of Facebook may sound too good to be true. In this talk, I will discuss our efforts to catch data races, deadlocks, and other concurrency pitfalls by deploying two analyzers based on Facebook Infer that comment at code review time, giving programmers early feedback.
5:30pm - 6:30pm

Networking Happy Hour and Office Hours

Join the @Scale Mailing List and Get the Latest News & Event Info

Code of Conduct