Event times below are displayed in PT.
Systems @Scale is an invitation-only technical conference for engineers that manage large-scale information systems serving millions of people.
As our systems continue to scale, the problem of understanding if they’re behaving as desired gets progressively harder. As a community, we have developed tools, techniques, and approaches that can be applied to observing the state of these complex distributed systems with the goal of understanding system availability, reliability, performance, and efficiency.
We’ll spend the day covering a wide range of topics exploring these challenges and collaborating on the development of new solutions.
Event times below are displayed in PT.
At Facebook, we care deeply about performance for reasons such as improving user experience, reducing environmental impact, and bringing down operational costs. Quickly and precisely diagnosing and pinpointing root causes of regressions as well as identifying optimization opportunities are key challenges for engineers that are trying to achieve their performance goals. Sophisticated analysis and visualization tools allow us to gain insights and draw conclusions about collected performance and observability data. In this talk, we share experiences investigating performance regressions using two of our in-house performance analysis tools CV and Tracery.
Our industry has embraced microservices as an architectural pattern, resulting in an exponential increase in the complexity of the distributed systems we operate. For example, Uber backend consists of thousands of microservices, and their inter-dependencies are constantly changing. Distributed tracing has emerged as the go-to solution for understanding what’s going on in these ever-changing architectures. The basic traces are a good start, but even they struggle dealing with the complexity of modern systems, where a single “book a ride” request can generate several thousand trace events. Too much information! This talk starts with a refresher of distributed tracing as a core observability tool in modern systems, from single trace view to aggregate analysis. Then we show how Uber uses data mining, complexity reduction, and intuitive visualizations to bring the real traces (not the toy examples) back into the realm of human comprehension abilities; guide the user towards actionable insights about the root cause of the outages; and drastically reduce time to mitigation.
A single trace can reveal many things: network latencies, time spent in databases, a service spinning idly, etc. but finding the trace that demonstrates a problem in a large distributed application is very hard. By looking at traces in aggregate, we can eliminate the need to state and validate hypotheses and instead answers start to emerge naturally. This talk will present the power of aggregate analysis of distributed traces by highlight its applications beyond performance troubleshooting.
Instagram is growing quickly - every day more developers add more features for more users. How does Instagram investigate efficiency bottlenecks? How are performance regressions in production caught, triaged, and resolved? How can we do this all efficiently in production? In this talk, we give an overview of the profiling framework used to understand the production performance of Instagram’s webserver, how the data is processed for regression detection and general efficiency work, and walk through previous iterations of this system to understand changes and improvements.
This session discusses Google's planet-wide monitoring system, Monarch, highlighting some of the challenges and solutions we've encountered. We'll talk about the impact of implementation and design on scaling, including process concurrency on overall distributed system properties, pushing queries down to data, and controlling query fanout.
Discussions about Observability and its uses are often rooted in the context of production services. While this makes sense given the target demographic of much of today’s Observability tooling, Observability is bigger than production and until we expand the conversation, we’ll continue to miss opportunities to improve our organizations (and our lives!) because we simply don’t see them. Oncall health, capacity planning, client code, cloud costs, HR workflows, and even query patterns on your internal wiki are all important aspects of your business bursting with information and opportunities if you have the ability to analyze them.
Scribe is a flexible data transport system used widely across Facebook. It underpins many products, revenue streams, and other applications of critical importance to the company. Scribe runs on >3M hosts and at peak, Scribe ingests 2.5TB/s and delivers 7TB/s of data. Monitoring and operating a system at this scale is challenging, which is why we have had to build dedicated systems with the sole goal of solving observability for Scribe. This talk gives a deep dive into two such dedicated systems used to monitor Scribe’s vitals, and sheds light on the design trade-offs that were taken when building them.
As systems scale, the data collected to understand behavior grows in volume and complexity. Observability systems that empower insight have conflicting challenges. Writes have to be high throughput to support the immense volume, but queries must be low latency to be suitable for reactive operations. Then, while query traffic is repetitive in aggregate, the important queries, testing hypotheses in the crucible of an outage, are unique and unpredictable. This talk examines the difficulties in meeting these requirements under ever growing load, and the architectural tradeoffs we've employed to support this accelerating growth.
Scuba is Facebook’s platform for realtime ingestion, processing, storing, and querying of structured logs from the entire fleet of machines. Scuba makes data available for querying in less than a minute and uses a massive fanout architecture to provide query results in less than a second. In this talk, we discuss some of the monitoring, debugging, and ad-hoc analytic use cases that are enabled by Scuba’s “log-everything” approach before we go into details about the system’s architecture and the set of tradeoffs we had to make to scale the platform. We also discuss some of our future plans and the central role we see Scuba playing in Facebook’s observability infrastructure.
Developing meaningful SLIs is not an easy task, but your Error Budgets and your SLOs are only useful if they’re informed by good SLIs. In this talk we’ll be performing a deep dive on how to develop SLIs that actually reflect the journeys of your users. We’ll start by describing an example web service that should look familiar to you, identify the “low-hanging fruit” metrics you might be tempted to use, talk about the limitations of these low-hanging fruit, and conclude with some concrete examples of what useful SLIs for such a system might actually look like.