Systems @Scale Fall 2019

SEPTEMBER 18, 2019 @ 8:30 AM PDT - 6:00 PM PDT

Designed for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges.

RSVPS CLOSED

AGENDA SPEAKERS

ABOUT EVENT

Systems @Scale is an invitation-only technical conference for engineers that manage large-scale information systems serving millions of people.

As our systems continue to scale, the problem of understanding if they’re behaving as desired gets progressively harder. As a community, we have developed tools, techniques, and approaches that can be applied to observing the state of these complex distributed systems with the goal of understanding system availability, reliability, performance, and efficiency.

We’ll spend the day covering a wide range of topics exploring these challenges and collaborating on the development of new solutions.

EVENT AGENDA

Event times below are displayed in PT.

September 18

08:30 AM - 10:00 AM

Registration & Breakfast

10:00 AM - 10:05 AM

Welcome to Systems @Scale

Speaker Jeromy Carriere,Facebook

10:05 AM - 10:30 AM

Keynote: How “Deep Systems” Broke Observability...And What We Can Do About It

WATCH NOW

Speaker Ben Sigelman,LightStep

10:30 AM - 11:00 AM

A Tale of Two Performance Analysis Tools

WATCH NOW

At Facebook, we care deeply about performance for reasons such as improving user experience, reducing environmental impact, and bringing down operational costs. Quickly and precisely diagnosing and pinpointing root causes of regressions as well as identifying optimization opportunities are key challenges for engineers that are trying to achieve their performance goals. Sophisticated analysis and visualization tools allow us to gain insights and draw conclusions about collected performance and observability data. In this talk, we share experiences investigating performance regressions using two of our in-house performance analysis tools CV and Tracery.

Speaker Helga Gudmundsdottir,Facebook

11:00 AM - 11:30 AM

Comprehending Incomprehensible Architecture

WATCH NOW

Our industry has embraced microservices as an architectural pattern, resulting in an exponential increase in the complexity of the distributed systems we operate. For example, Uber backend consists of thousands of microservices, and their inter-dependencies are constantly changing. Distributed tracing has emerged as the go-to solution for understanding what’s going on in these ever-changing architectures. The basic traces are a good start, but even they struggle dealing with the complexity of modern systems, where a single “book a ride” request can generate several thousand trace events. Too much information! This talk starts with a refresher of distributed tracing as a core observability tool in modern systems, from single trace view to aggregate analysis. Then we show how Uber uses data mining, complexity reduction, and intuitive visualizations to bring the real traces (not the toy examples) back into the realm of human comprehension abilities; guide the user towards actionable insights about the root cause of the outages; and drastically reduce time to mitigation.

Speaker Yuri Shkuro,Uber

11:30 AM - 12:00 PM

A Picture Is Worth 1,000 Traces

WATCH NOW

A single trace can reveal many things: network latencies, time spent in databases, a service spinning idly, etc. but finding the trace that demonstrates a problem in a large distributed application is very hard. By looking at traces in aggregate, we can eliminate the need to state and validate hypotheses and instead answers start to emerge naturally. This talk will present the power of aggregate analysis of distributed traces by highlight its applications beyond performance troubleshooting.

Speaker Spiros Xanthos,Omnition

Speaker Constance Caramanolis,Omnition

12:00 PM - 01:00 PM

Lunch

01:00 PM - 01:30 PM

Service Efficiency at Instagram Scale

WATCH NOW

Instagram is growing quickly - every day more developers add more features for more users. How does Instagram investigate efficiency bottlenecks? How are performance regressions in production caught, triaged, and resolved? How can we do this all efficiently in production? In this talk, we give an overview of the profiling framework used to understand the production performance of Instagram’s webserver, how the data is processed for regression detection and general efficiency work, and walk through previous iterations of this system to understand changes and improvements.

Speaker Dave Marchevsky,Facebook

Speaker Pranav Thulasiram Bhat,Instagram

01:30 PM - 02:00 PM

Monarch, Google’s Planet-Scale Monitoring Infrastructure

WATCH NOW

This session discusses Google's planet-wide monitoring system, Monarch, highlighting some of the challenges and solutions we've encountered. We'll talk about the impact of implementation and design on scaling, including process concurrency on overall distributed system properties, pushing queries down to data, and controlling query fanout.

Speaker George Talbot,Google

02:00 PM - 02:30 PM

Observability, It's Bigger Than Production

WATCH NOW

Discussions about Observability and its uses are often rooted in the context of production services. While this makes sense given the target demographic of much of today’s Observability tooling, Observability is bigger than production and until we expand the conversation, we’ll continue to miss opportunities to improve our organizations (and our lives!) because we simply don’t see them. Oncall health, capacity planning, client code, cloud costs, HR workflows, and even query patterns on your internal wiki are all important aspects of your business bursting with information and opportunities if you have the ability to analyze them.

Speaker Gordon Radlein,Etsy

02:30 PM - 03:00 PM

Break

03:00 PM - 03:30 PM

Scribe Observability - Monitoring A Message Bus At Scale

WATCH NOW

Scribe is a flexible data transport system used widely across Facebook. It underpins many products, revenue streams, and other applications of critical importance to the company. Scribe runs on >3M hosts and at peak, Scribe ingests 2.5TB/s and delivers 7TB/s of data. Monitoring and operating a system at this scale is challenging, which is why we have had to build dedicated systems with the sole goal of solving observability for Scribe. This talk gives a deep dive into two such dedicated systems used to monitor Scribe’s vitals, and sheds light on the design trade-offs that were taken when building them.

Speaker Dino Wernli,Facebook

Speaker Cristina Opriceana,Facebook

03:30 PM - 04:00 PM

Scaling Observability Data at Data Dog

WATCH NOW

As systems scale, the data collected to understand behavior grows in volume and complexity. Observability systems that empower insight have conflicting challenges. Writes have to be high throughput to support the immense volume, but queries must be low latency to be suitable for reactive operations. Then, while query traffic is repetitive in aggregate, the important queries, testing hypotheses in the crucible of an outage, are unique and unpredictable. This talk examines the difficulties in meeting these requirements under ever growing load, and the architectural tradeoffs we've employed to support this accelerating growth.

Speaker Jason Moiron,Data Dog

04:00 PM - 04:30 PM

Scuba - Real-time Monitoring And Log Analytics At Scale

WATCH NOW

Scuba is Facebook’s platform for realtime ingestion, processing, storing, and querying of structured logs from the entire fleet of machines. Scuba makes data available for querying in less than a minute and uses a massive fanout architecture to provide query results in less than a second. In this talk, we discuss some of the monitoring, debugging, and ad-hoc analytic use cases that are enabled by Scuba’s “log-everything” approach before we go into details about the system’s architecture and the set of tradeoffs we had to make to scale the platform. We also discuss some of our future plans and the central role we see Scuba playing in Facebook’s observability infrastructure.

Speaker Harani Mukkala,Facebook

Speaker Stavros Harizopoulos,Facebook

04:30 PM - 05:00 PM

Developing Meaningful SLIs for Fun and Profit

WATCH NOW

Developing meaningful SLIs is not an easy task, but your Error Budgets and your SLOs are only useful if they’re informed by good SLIs. In this talk we’ll be performing a deep dive on how to develop SLIs that actually reflect the journeys of your users. We’ll start by describing an example web service that should look familiar to you, identify the “low-hanging fruit” metrics you might be tempted to use, talk about the limitations of these low-hanging fruit, and conclude with some concrete examples of what useful SLIs for such a system might actually look like.

Speaker Alex Hidalgo,Squarespace

05:00 PM - 05:15 PM

Closing Remarks

Speaker Jeromy Carriere,Facebook

05:15 PM - 06:00 PM

Networking Happy Hour

SPEAKERS AND MODERATORS

Jeromy Carriere

Facebook

Ben Sigelman

LightStep

Helga Gudmundsdottir

Facebook

Yuri Shkuro

Uber

Spiros Xanthos

Omnition

Constance Caramanolis

Omnition

Dave Marchevsky

Facebook

Pranav Thulasiram Bhat

Instagram

George Talbot

Google

Gordon Radlein

Etsy

Dino Wernli

Facebook

Cristina Opriceana

Facebook

Jason Moiron

Data Dog

Harani Mukkala

Facebook

Stavros Harizopoulos

Facebook

Alex Hidalgo

Squarespace

UPCOMING EVENT | Systems and Networking

Networking 2026

August 25, 2026 Santa Clara Convention Center, Santa Clara, CA In 2026, @Scale: Networking will continue to focus on the evolution of AI Networking. To address the growing complexity of network operations, we will examine...

UPCOMING EVENT | Mobile, Video and Web

Product 2026

October 28, 2026 Meta Campus, Menlo Park, CA @Scale: Product is an exciting evolution of the @Scale conference series, uniting the best of Product, RTC, Mobile, and Video under a single AI-native theme. We are...

PAST EVENT 06/17/2026 | Data, Machine Learning and AI

AI & Data 2026

June 17, 2026 Meta Campus, Menlo Park, CA Meta’s Engineering and Infrastructure teams are excited to bring together a global contingent of engineers who are interested in building, operating, and using AI and data systems...

PAST EVENT 06/25/2026 | Systems and Networking

Systems & Reliability 2026

June 25, 2026 Meydenbauer Center, Bellevue, Washington Building the advanced infrastructure necessary to power today's sophisticated AI models represents a monumental engineering challenge. This endeavor demands the creation of highly scalable, high-performance, and supremely reliable...