TOPIC: Data, Systems and Networking

Systems @Scale Fall 2019

SEPTEMBER 18, 2019 @ 08:30 AM - SEPTEMBER 18, 2019 @ 06:00 PM PT
Designed for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges.
RSVPS CLOSED
AGENDA SPEAKERS

ABOUT EVENT

Systems @Scale is an invitation-only technical conference for engineers that manage large-scale information systems serving millions of people.

As our systems continue to scale, the problem of understanding if they’re behaving as desired gets progressively harder. As a community, we have developed tools, techniques, and approaches that can be applied to observing the state of these complex distributed systems with the goal of understanding system availability, reliability, performance, and efficiency.

We’ll spend the day covering a wide range of topics exploring these challenges and collaborating on the development of new solutions.

EVENT AGENDA

Event times below are displayed in PT.

September 18

08:30 AM - 10:00 AM
Registration & Breakfast
10:00 AM - 10:05 AM
Welcome to Systems @Scale
SPEAKER Jeromy Carriere,Facebook
10:05 AM - 10:30 AM
Keynote: How “Deep Systems” Broke Observability...And What We Can Do About It
SPEAKER Ben Sigelman,LightStep
10:30 AM - 11:00 AM
A Tale of Two Performance Analysis Tools

At Facebook, we care deeply about performance for reasons such as improving user experience, reducing environmental impact, and bringing down operational costs. Quickly and precisely diagnosing and pinpointing root causes of regressions as well as identifying optimization opportunities are key challenges for engineers that are trying to achieve their performance goals. Sophisticated analysis and visualization tools allow us to gain insights and draw conclusions about collected performance and observability data. In this talk, we share experiences investigating performance regressions using two of our in-house performance analysis tools CV and Tracery.

SPEAKER Helga Gudmundsdottir,Facebook
11:00 AM - 11:30 AM
Comprehending Incomprehensible Architecture

Our industry has embraced microservices as an architectural pattern, resulting in an exponential increase in the complexity of the distributed systems we operate. For example, Uber backend consists of thousands of microservices, and their inter-dependencies are constantly changing. Distributed tracing has emerged as the go-to solution for understanding what’s going on in these ever-changing architectures. The basic traces are a good start, but even they struggle dealing with the complexity of modern systems, where a single “book a ride” request can generate several thousand trace events. Too much information! This talk starts with a refresher of distributed tracing as a core observability tool in modern systems, from single trace view to aggregate analysis. Then we show how Uber uses data mining, complexity reduction, and intuitive visualizations to bring the real traces (not the toy examples) back into the realm of human comprehension abilities; guide the user towards actionable insights about the root cause of the outages; and drastically reduce time to mitigation.

SPEAKER Yuri Shkuro,Uber
11:30 AM - 12:00 PM
A Picture Is Worth 1,000 Traces

A single trace can reveal many things: network latencies, time spent in databases, a service spinning idly, etc. but finding the trace that demonstrates a problem in a large distributed application is very hard. By looking at traces in aggregate, we can eliminate the need to state and validate hypotheses and instead answers start to emerge naturally. This talk will present the power of aggregate analysis of distributed traces by highlight its applications beyond performance troubleshooting.

SPEAKER Spiros Xanthos,Omnition
SPEAKER Constance Caramanolis,Omnition
12:00 PM - 01:00 PM
Lunch
01:00 PM - 01:30 PM
Service Efficiency at Instagram Scale

Instagram is growing quickly - every day more developers add more features for more users. How does Instagram investigate efficiency bottlenecks? How are performance regressions in production caught, triaged, and resolved? How can we do this all efficiently in production? In this talk, we give an overview of the profiling framework used to understand the production performance of Instagram’s webserver, how the data is processed for regression detection and general efficiency work, and walk through previous iterations of this system to understand changes and improvements.

SPEAKER Dave Marchevsky,Facebook
SPEAKER Pranav Thulasiram Bhat,Instagram
01:30 PM - 02:00 PM
Monarch, Google’s Planet-Scale Monitoring Infrastructure

This session discusses Google's planet-wide monitoring system, Monarch, highlighting some of the challenges and solutions we've encountered. We'll talk about the impact of implementation and design on scaling, including process concurrency on overall distributed system properties, pushing queries down to data, and controlling query fanout.

SPEAKER George Talbot,Google
02:00 PM - 02:30 PM
Observability, It's Bigger Than Production

Discussions about Observability and its uses are often rooted in the context of production services. While this makes sense given the target demographic of much of today’s Observability tooling, Observability is bigger than production and until we expand the conversation, we’ll continue to miss opportunities to improve our organizations (and our lives!) because we simply don’t see them. Oncall health, capacity planning, client code, cloud costs, HR workflows, and even query patterns on your internal wiki are all important aspects of your business bursting with information and opportunities if you have the ability to analyze them.

SPEAKER Gordon Radlein,Etsy
02:30 PM - 03:00 PM
Break
03:00 PM - 03:30 PM
Scribe Observability - Monitoring A Message Bus At Scale

Scribe is a flexible data transport system used widely across Facebook. It underpins many products, revenue streams, and other applications of critical importance to the company. Scribe runs on >3M hosts and at peak, Scribe ingests 2.5TB/s and delivers 7TB/s of data. Monitoring and operating a system at this scale is challenging, which is why we have had to build dedicated systems with the sole goal of solving observability for Scribe. This talk gives a deep dive into two such dedicated systems used to monitor Scribe’s vitals, and sheds light on the design trade-offs that were taken when building them.

SPEAKER Dino Wernli,Facebook
SPEAKER Cristina Opriceana,Facebook
03:30 PM - 04:00 PM
Scaling Observability Data at Data Dog

As systems scale, the data collected to understand behavior grows in volume and complexity. Observability systems that empower insight have conflicting challenges. Writes have to be high throughput to support the immense volume, but queries must be low latency to be suitable for reactive operations. Then, while query traffic is repetitive in aggregate, the important queries, testing hypotheses in the crucible of an outage, are unique and unpredictable. This talk examines the difficulties in meeting these requirements under ever growing load, and the architectural tradeoffs we've employed to support this accelerating growth.

SPEAKER Jason Moiron,Data Dog
04:00 PM - 04:30 PM
Scuba - Real-time Monitoring And Log Analytics At Scale

Scuba is Facebook’s platform for realtime ingestion, processing, storing, and querying of structured logs from the entire fleet of machines. Scuba makes data available for querying in less than a minute and uses a massive fanout architecture to provide query results in less than a second. In this talk, we discuss some of the monitoring, debugging, and ad-hoc analytic use cases that are enabled by Scuba’s “log-everything” approach before we go into details about the system’s architecture and the set of tradeoffs we had to make to scale the platform. We also discuss some of our future plans and the central role we see Scuba playing in Facebook’s observability infrastructure.

SPEAKER Harani Mukkala,Facebook
SPEAKER Stavros Harizopoulos,Facebook
04:30 PM - 05:00 PM
Developing Meaningful SLIs for Fun and Profit

Developing meaningful SLIs is not an easy task, but your Error Budgets and your SLOs are only useful if they’re informed by good SLIs. In this talk we’ll be performing a deep dive on how to develop SLIs that actually reflect the journeys of your users. We’ll start by describing an example web service that should look familiar to you, identify the “low-hanging fruit” metrics you might be tempted to use, talk about the limitations of these low-hanging fruit, and conclude with some concrete examples of what useful SLIs for such a system might actually look like.

SPEAKER Alex Hidalgo,Squarespace
05:00 PM - 05:15 PM
Closing Remarks
SPEAKER Jeromy Carriere,Facebook
05:15 PM - 06:00 PM
Networking Happy Hour

SPEAKERS AND MODERATORS

Jeromy Carriere

Facebook

Ben Sigelman

LightStep

Helga Gudmundsdottir

Facebook

Yuri Shkuro

Uber

Spiros Xanthos

Omnition

Constance Caramanolis

Omnition

Dave Marchevsky

Facebook

Pranav Thulasiram Bhat

Instagram

George Talbot

Google

Gordon Radlein

Etsy

Dino Wernli

Facebook

Cristina Opriceana

Facebook

Jason Moiron

Data Dog

Harani Mukkala

Facebook

Stavros Harizopoulos

Facebook

Alex Hidalgo

Squarespace

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy