TOPIC: Data, Systems and Networking

Systems @Scale Fall 2019

SEPTEMBER 18, 2019 @ 8:30 AM PDT - 6:00 PM PDT

Designed for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges.

RSVPS CLOSED

AGENDA SPEAKERS

ABOUT EVENT

Systems @Scale is an invitation-only technical conference for engineers that manage large-scale information systems serving millions of people.

As our systems continue to scale, the problem of understanding if they’re behaving as desired gets progressively harder. As a community, we have developed tools, techniques, and approaches that can be applied to observing the state of these complex distributed systems with the goal of understanding system availability, reliability, performance, and efficiency.

We’ll spend the day covering a wide range of topics exploring these challenges and collaborating on the development of new solutions.

EVENT AGENDA

Event times below are displayed in PT.

September 18

08:30 AM - 10:00 AM

Registration & Breakfast

10:00 AM - 10:05 AM

Welcome to Systems @Scale

Speaker Jeromy Carriere,Facebook

10:05 AM - 10:30 AM

Keynote: How “Deep Systems” Broke Observability...And What We Can Do About It

WATCH NOW

Speaker Ben Sigelman,LightStep

10:30 AM - 11:00 AM

A Tale of Two Performance Analysis Tools

WATCH NOW

At Facebook, we care deeply about performance for reasons such as improving user experience, reducing environmental impact, and bringing down operational costs. Quickly and precisely diagnosing and pinpointing root causes of regressions as well as identifying optimization opportunities are key challenges for engineers that are trying to achieve their performance goals. Sophisticated analysis and visualization tools allow us to gain insights and draw conclusions about collected performance and observability data. In this talk, we share experiences investigating performance regressions using two of our in-house performance analysis tools CV and Tracery.

Speaker Helga Gudmundsdottir,Facebook

11:00 AM - 11:30 AM

Comprehending Incomprehensible Architecture

WATCH NOW

Our industry has embraced microservices as an architectural pattern, resulting in an exponential increase in the complexity of the distributed systems we operate. For example, Uber backend consists of thousands of microservices, and their inter-dependencies are constantly changing. Distributed tracing has emerged as the go-to solution for understanding what’s going on in these ever-changing architectures. The basic traces are a good start, but even they struggle dealing with the complexity of modern systems, where a single “book a ride” request can generate several thousand trace events. Too much information! This talk starts with a refresher of distributed tracing as a core observability tool in modern systems, from single trace view to aggregate analysis. Then we show how Uber uses data mining, complexity reduction, and intuitive visualizations to bring the real traces (not the toy examples) back into the realm of human comprehension abilities; guide the user towards actionable insights about the root cause of the outages; and drastically reduce time to mitigation.

Speaker Yuri Shkuro,Uber

11:30 AM - 12:00 PM

A Picture Is Worth 1,000 Traces

WATCH NOW

A single trace can reveal many things: network latencies, time spent in databases, a service spinning idly, etc. but finding the trace that demonstrates a problem in a large distributed application is very hard. By looking at traces in aggregate, we can eliminate the need to state and validate hypotheses and instead answers start to emerge naturally. This talk will present the power of aggregate analysis of distributed traces by highlight its applications beyond performance troubleshooting.

Speaker Spiros Xanthos,Omnition

Speaker Constance Caramanolis,Omnition

12:00 PM - 01:00 PM

Lunch

01:00 PM - 01:30 PM

Service Efficiency at Instagram Scale

WATCH NOW

Instagram is growing quickly - every day more developers add more features for more users. How does Instagram investigate efficiency bottlenecks? How are performance regressions in production caught, triaged, and resolved? How can we do this all efficiently in production? In this talk, we give an overview of the profiling framework used to understand the production performance of Instagram’s webserver, how the data is processed for regression detection and general efficiency work, and walk through previous iterations of this system to understand changes and improvements.

Speaker Dave Marchevsky,Facebook

Speaker Pranav Thulasiram Bhat,Instagram

01:30 PM - 02:00 PM

Monarch, Google’s Planet-Scale Monitoring Infrastructure

WATCH NOW

This session discusses Google's planet-wide monitoring system, Monarch, highlighting some of the challenges and solutions we've encountered. We'll talk about the impact of implementation and design on scaling, including process concurrency on overall distributed system properties, pushing queries down to data, and controlling query fanout.

Speaker George Talbot,Google

02:00 PM - 02:30 PM

Observability, It's Bigger Than Production

WATCH NOW

Discussions about Observability and its uses are often rooted in the context of production services. While this makes sense given the target demographic of much of today’s Observability tooling, Observability is bigger than production and until we expand the conversation, we’ll continue to miss opportunities to improve our organizations (and our lives!) because we simply don’t see them. Oncall health, capacity planning, client code, cloud costs, HR workflows, and even query patterns on your internal wiki are all important aspects of your business bursting with information and opportunities if you have the ability to analyze them.

Speaker Gordon Radlein,Etsy

02:30 PM - 03:00 PM

Break

03:00 PM - 03:30 PM

Scribe Observability - Monitoring A Message Bus At Scale

WATCH NOW

Scribe is a flexible data transport system used widely across Facebook. It underpins many products, revenue streams, and other applications of critical importance to the company. Scribe runs on >3M hosts and at peak, Scribe ingests 2.5TB/s and delivers 7TB/s of data. Monitoring and operating a system at this scale is challenging, which is why we have had to build dedicated systems with the sole goal of solving observability for Scribe. This talk gives a deep dive into two such dedicated systems used to monitor Scribe’s vitals, and sheds light on the design trade-offs that were taken when building them.

Speaker Dino Wernli,Facebook

Speaker Cristina Opriceana,Facebook

03:30 PM - 04:00 PM

Scaling Observability Data at Data Dog

WATCH NOW

As systems scale, the data collected to understand behavior grows in volume and complexity. Observability systems that empower insight have conflicting challenges. Writes have to be high throughput to support the immense volume, but queries must be low latency to be suitable for reactive operations. Then, while query traffic is repetitive in aggregate, the important queries, testing hypotheses in the crucible of an outage, are unique and unpredictable. This talk examines the difficulties in meeting these requirements under ever growing load, and the architectural tradeoffs we've employed to support this accelerating growth.

Speaker Jason Moiron,Data Dog

04:00 PM - 04:30 PM

Scuba - Real-time Monitoring And Log Analytics At Scale

WATCH NOW

Scuba is Facebook’s platform for realtime ingestion, processing, storing, and querying of structured logs from the entire fleet of machines. Scuba makes data available for querying in less than a minute and uses a massive fanout architecture to provide query results in less than a second. In this talk, we discuss some of the monitoring, debugging, and ad-hoc analytic use cases that are enabled by Scuba’s “log-everything” approach before we go into details about the system’s architecture and the set of tradeoffs we had to make to scale the platform. We also discuss some of our future plans and the central role we see Scuba playing in Facebook’s observability infrastructure.

Speaker Harani Mukkala,Facebook

Speaker Stavros Harizopoulos,Facebook

04:30 PM - 05:00 PM

Developing Meaningful SLIs for Fun and Profit

WATCH NOW

Developing meaningful SLIs is not an easy task, but your Error Budgets and your SLOs are only useful if they’re informed by good SLIs. In this talk we’ll be performing a deep dive on how to develop SLIs that actually reflect the journeys of your users. We’ll start by describing an example web service that should look familiar to you, identify the “low-hanging fruit” metrics you might be tempted to use, talk about the limitations of these low-hanging fruit, and conclude with some concrete examples of what useful SLIs for such a system might actually look like.

Speaker Alex Hidalgo,Squarespace

05:00 PM - 05:15 PM

Closing Remarks

Speaker Jeromy Carriere,Facebook

05:15 PM - 06:00 PM

Networking Happy Hour

SPEAKERS AND MODERATORS

Jeromy Carriere

Facebook

Ben Sigelman

LightStep

Helga Gudmundsdottir

Facebook

Yuri Shkuro

Uber

Spiros Xanthos

Omnition

Constance Caramanolis

Omnition

Dave Marchevsky

Facebook

Pranav Thulasiram Bhat

Instagram

George Talbot

Google

Gordon Radlein

Etsy

Dino Wernli

Facebook

Cristina Opriceana

Facebook

Jason Moiron

Data Dog

Harani Mukkala

Facebook

Stavros Harizopoulos

Facebook

Alex Hidalgo

Squarespace

UPCOMING EVENT JULY 31, 2024 @ 2:30 PM PDT - 7:00 PM PDT - IN PERSON EVENT | AUGUST 7, 2024 @ 2:30 PM PDT - 5:30 PM PDT - VIRTUAL PROGRAM AI @Scale

AI Infra @Scale 2024

Meta’s Engineering and Infrastructure teams are excited to return for the second year in a row to host AI Infra @Scale on July 31. This year’s event is open to a limited number of in-person...

UPCOMING EVENT August 14, 2024 Product @Scale

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. The @Scale community focuses on bringing forward people's experiences in creating innovative solutions to large-scale products serving millions or...

UPCOMING EVENT September 11, 2024 | Santa Clara Convention Center Networking @Scale

Networking @Scale 2024

Meta’s Networking team invites you to Networking@scale on September 11th. . This year’s event is an in-person event hosted at the Santa Clara Convention center and will also be live streamed for virtual attendees. Registration...

UPCOMING EVENT October 9, 2024 Reliability @Scale

Reliability @Scale 2024

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems and products at massive scale. Whether it’s novel design decisions, or outages that impact billions...

UPCOMING EVENT October 23, 2024 Mobile @Scale

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...

UPCOMING EVENT November 20, 2024 Video @Scale

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...

PAST EVENT March 20, 2024 @ 9am PT - 3pm PT RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

Past EVENT May 22, 2024 Data @Scale

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...

Past EVENT June 12, 2024 Systems @Scale

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...

FIND @SCALE TOPICS

Dev Tools and Ops, Privacy, Sustainability and Performance Fighting Abuse and Security Machine Learning and AI Mobile, Video and Web

Systems @Scale Fall 2019

ABOUT EVENT

EVENT AGENDA

September 18

September 18

SPEAKERS AND MODERATORS

Jeromy Carriere

Ben Sigelman

Helga Gudmundsdottir

Yuri Shkuro

Spiros Xanthos

Constance Caramanolis

Dave Marchevsky

Pranav Thulasiram Bhat

George Talbot

Gordon Radlein

Dino Wernli

Cristina Opriceana

Jason Moiron

Harani Mukkala

Stavros Harizopoulos

Alex Hidalgo

AI Infra @Scale 2024

Product @Scale 2024

Networking @Scale 2024

Reliability @Scale 2024

Mobile @Scale 2024

Video @Scale 2024

RTC @Scale 2024

Data @Scale 2024

Systems @Scale 2024

FIND @SCALE TOPICS

EXPLORE OTHER SERIES

Data @Scale

Networking @Scale

Reliability @Scale