TOPIC: Data, Systems and Networking

Systems @Scale Fall 2019

SEPTEMBER 18, 2019 @ 8:30 AM PDT - 6:00 PM PDT
Designed for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges.


Systems @Scale is an invitation-only technical conference for engineers that manage large-scale information systems serving millions of people.

As our systems continue to scale, the problem of understanding if they’re behaving as desired gets progressively harder. As a community, we have developed tools, techniques, and approaches that can be applied to observing the state of these complex distributed systems with the goal of understanding system availability, reliability, performance, and efficiency.

We’ll spend the day covering a wide range of topics exploring these challenges and collaborating on the development of new solutions.


Event times below are displayed in PT.

September 18

08:30 AM - 10:00 AM
Registration & Breakfast
10:00 AM - 10:05 AM
Welcome to Systems @Scale
Speaker Jeromy Carriere,Facebook
10:05 AM - 10:30 AM
Keynote: How “Deep Systems” Broke Observability...And What We Can Do About It
Speaker Ben Sigelman,LightStep
10:30 AM - 11:00 AM
A Tale of Two Performance Analysis Tools

At Facebook, we care deeply about performance for reasons such as improving user experience, reducing environmental impact, and bringing down operational costs. Quickly and precisely diagnosing and pinpointing root causes of regressions as well as identifying optimization opportunities are key challenges for engineers that are trying to achieve their performance goals. Sophisticated analysis and visualization tools allow us to gain insights and draw conclusions about collected performance and observability data. In this talk, we share experiences investigating performance regressions using two of our in-house performance analysis tools CV and Tracery.

Speaker Helga Gudmundsdottir,Facebook
11:00 AM - 11:30 AM
Comprehending Incomprehensible Architecture

Our industry has embraced microservices as an architectural pattern, resulting in an exponential increase in the complexity of the distributed systems we operate. For example, Uber backend consists of thousands of microservices, and their inter-dependencies are constantly changing. Distributed tracing has emerged as the go-to solution for understanding what’s going on in these ever-changing architectures. The basic traces are a good start, but even they struggle dealing with the complexity of modern systems, where a single “book a ride” request can generate several thousand trace events. Too much information! This talk starts with a refresher of distributed tracing as a core observability tool in modern systems, from single trace view to aggregate analysis. Then we show how Uber uses data mining, complexity reduction, and intuitive visualizations to bring the real traces (not the toy examples) back into the realm of human comprehension abilities; guide the user towards actionable insights about the root cause of the outages; and drastically reduce time to mitigation.

Speaker Yuri Shkuro,Uber
11:30 AM - 12:00 PM
A Picture Is Worth 1,000 Traces

A single trace can reveal many things: network latencies, time spent in databases, a service spinning idly, etc. but finding the trace that demonstrates a problem in a large distributed application is very hard. By looking at traces in aggregate, we can eliminate the need to state and validate hypotheses and instead answers start to emerge naturally. This talk will present the power of aggregate analysis of distributed traces by highlight its applications beyond performance troubleshooting.

Speaker Spiros Xanthos,Omnition
Speaker Constance Caramanolis,Omnition
12:00 PM - 01:00 PM
01:00 PM - 01:30 PM
Service Efficiency at Instagram Scale

Instagram is growing quickly - every day more developers add more features for more users. How does Instagram investigate efficiency bottlenecks? How are performance regressions in production caught, triaged, and resolved? How can we do this all efficiently in production? In this talk, we give an overview of the profiling framework used to understand the production performance of Instagram’s webserver, how the data is processed for regression detection and general efficiency work, and walk through previous iterations of this system to understand changes and improvements.

Speaker Dave Marchevsky,Facebook
Speaker Pranav Thulasiram Bhat,Instagram
01:30 PM - 02:00 PM
Monarch, Google’s Planet-Scale Monitoring Infrastructure

This session discusses Google's planet-wide monitoring system, Monarch, highlighting some of the challenges and solutions we've encountered. We'll talk about the impact of implementation and design on scaling, including process concurrency on overall distributed system properties, pushing queries down to data, and controlling query fanout.

Speaker George Talbot,Google
02:00 PM - 02:30 PM
Observability, It's Bigger Than Production

Discussions about Observability and its uses are often rooted in the context of production services. While this makes sense given the target demographic of much of today’s Observability tooling, Observability is bigger than production and until we expand the conversation, we’ll continue to miss opportunities to improve our organizations (and our lives!) because we simply don’t see them. Oncall health, capacity planning, client code, cloud costs, HR workflows, and even query patterns on your internal wiki are all important aspects of your business bursting with information and opportunities if you have the ability to analyze them.

Speaker Gordon Radlein,Etsy
02:30 PM - 03:00 PM
03:00 PM - 03:30 PM
Scribe Observability - Monitoring A Message Bus At Scale

Scribe is a flexible data transport system used widely across Facebook. It underpins many products, revenue streams, and other applications of critical importance to the company. Scribe runs on >3M hosts and at peak, Scribe ingests 2.5TB/s and delivers 7TB/s of data. Monitoring and operating a system at this scale is challenging, which is why we have had to build dedicated systems with the sole goal of solving observability for Scribe. This talk gives a deep dive into two such dedicated systems used to monitor Scribe’s vitals, and sheds light on the design trade-offs that were taken when building them.

Speaker Dino Wernli,Facebook
Speaker Cristina Opriceana,Facebook
03:30 PM - 04:00 PM
Scaling Observability Data at Data Dog

As systems scale, the data collected to understand behavior grows in volume and complexity. Observability systems that empower insight have conflicting challenges. Writes have to be high throughput to support the immense volume, but queries must be low latency to be suitable for reactive operations. Then, while query traffic is repetitive in aggregate, the important queries, testing hypotheses in the crucible of an outage, are unique and unpredictable. This talk examines the difficulties in meeting these requirements under ever growing load, and the architectural tradeoffs we've employed to support this accelerating growth.

Speaker Jason Moiron,Data Dog
04:00 PM - 04:30 PM
Scuba - Real-time Monitoring And Log Analytics At Scale

Scuba is Facebook’s platform for realtime ingestion, processing, storing, and querying of structured logs from the entire fleet of machines. Scuba makes data available for querying in less than a minute and uses a massive fanout architecture to provide query results in less than a second. In this talk, we discuss some of the monitoring, debugging, and ad-hoc analytic use cases that are enabled by Scuba’s “log-everything” approach before we go into details about the system’s architecture and the set of tradeoffs we had to make to scale the platform. We also discuss some of our future plans and the central role we see Scuba playing in Facebook’s observability infrastructure.

Speaker Harani Mukkala,Facebook
Speaker Stavros Harizopoulos,Facebook
04:30 PM - 05:00 PM
Developing Meaningful SLIs for Fun and Profit

Developing meaningful SLIs is not an easy task, but your Error Budgets and your SLOs are only useful if they’re informed by good SLIs. In this talk we’ll be performing a deep dive on how to develop SLIs that actually reflect the journeys of your users. We’ll start by describing an example web service that should look familiar to you, identify the “low-hanging fruit” metrics you might be tempted to use, talk about the limitations of these low-hanging fruit, and conclude with some concrete examples of what useful SLIs for such a system might actually look like.

Speaker Alex Hidalgo,Squarespace
05:00 PM - 05:15 PM
Closing Remarks
Speaker Jeromy Carriere,Facebook
05:15 PM - 06:00 PM
Networking Happy Hour


Jeromy Carriere


Ben Sigelman


Helga Gudmundsdottir


Yuri Shkuro


Spiros Xanthos


Constance Caramanolis


Dave Marchevsky


Pranav Thulasiram Bhat


George Talbot


Gordon Radlein


Dino Wernli


Cristina Opriceana


Jason Moiron

Data Dog

Harani Mukkala


Stavros Harizopoulos


Alex Hidalgo

UPCOMING EVENT   May 22, 2024 Data @Scale

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...
UPCOMING EVENT   June 12, 2024 Systems @Scale

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...
UPCOMING EVENT   07/31/2024 AI @Scale

AI Infra @Scale 2024

Meta's Engineering and Infrastructure teams are excited to host AI Infra @Scale, a one-day virtual event featuring a range of speakers from Meta who will unveil the latest AI infrastructure investments and innovations powering Meta's...
UPCOMING EVENT   August 7, 2024 Product @Scale

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. This year focuses on discussions that explore the creator ecosystem, and how AI will play a role in scaling...
UPCOMING EVENT   September 4-5, 2024 (2 day event) Networking @Scale

Networking @Scale 2024

Networking @Scale is a technical conference for engineers that build and manage large-scale networks. Meta’s Networking Infrastructure team is excited to host Networking @Scale, a two-day virtual event featuring a range of speakers from Meta...
UPCOMING EVENT   September 25, 2024 Reliability @Scale

Reliability @Scale 2024

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems and products at massive scale. Whether it’s novel design decisions, or outages that impact billions...
UPCOMING EVENT   October 23, 2024 Mobile @Scale

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...
UPCOMING EVENT   November 20, 2024 Video @Scale

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...
PAST EVENT   March 20, 2024 @ 9am PT - 3pm PT RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy