Event times below are displayed in PT.
Systems @Scale is a technical conference for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges. The @Scale community focuses on bringing people together to discuss these challenges and collaborate on the development of new solutions.
Event times below are displayed in PT.
A lot of work happens behind the scenes that requires time and processing that should not block real-time actions in our products. Things like notifications, event invites, and video rendering may entail long waits or require a large amount of processing power. In this talk, we will detail how Facebook handles asynchronous processing at large scale and the challenges that come with maintaining the reliability of a large multitenant system that's constantly growing in demand.
Learn more about Async here: https://engineering.fb.com/production-engineering/async/
Driven by the growth of Facebook users and a richer product experience, the space of back-end sharded services has proliferated in the past decade. Since 2011, we have been pioneering the idea of abstracting out "sharding as a platform" to help stateful services scale easily.
In this talk, we present Shard Manager, a generic shard management platform, and share how it facilitates the development and operation of upper hundreds of diverse stateful sharded services totaling tens of millions of shards hosted on hundreds of thousands of servers. We will look at how Shard Manager is fully integrated in our infrastructure ecosystem and provides a holistic, end-to-end solution supporting not only basic shard failover but also sophisticated load balancing, shard scaling, and operational safety.
At Facebook, virtually all our infrastructure is powered in some fashion by Apache Zookeeper. This includes service discovery, configuration management, package deployment, cluster management — every piece of our infrastructure must maintain commitments of consistency and durability in the face of machine failures, network partitions, and human error. More often than not, Zookeeper is the low-dependency metadata storage service of choice.
The infrastructure that Zookeeper powers comprises a ubiquitous cluster management platform, atop which the rest of Facebook’s software runs. This platform autonomously manages thousands of services across millions of machines, providing a huge degree of flexibility and leverage for engineers.
So when Facebook’s Zookeeper team decided that they, too, wanted this flexibility and leverage, it meant turning our dependency graph on its head. In this talk, we will present the 18-month journey that brought hundreds of Zookeeper ensembles in from the cold bare metal so they could safely run atop the cluster management platform that they make possible.
Electrical faults, issues during routine maintenance, or even incidents such as a snake crawling into our power infrastructure are common occurrences in our data centers. These events cause machine outages for a failure domain within a data center but can often escalate to a full data center level failure for a service. A large contributor to this escalation is that different types of hardware and services are concentrated in specific failure domains within a data center region, instead of being well spread across all failure domains. Losing this one failure domain means that we lose a large portion of a given service. And as our data center fleet continues to quickly grow over the next several years, we expect the number of these failures to grow as well.
In this talk, we will focus on how we have started to optimize for the failure domain spread of our hardware, services, and data to ensure that the loss of any failure domain within a data center region leads to the smallest portion of any service or hardware type being affected. This allows us to build subregion fault tolerance expectations across our systems, including being able to support buying enough buffers to allow us to lose a fault domain without losing the entire data center, and without negative impact to our end users.
Facebook's web tier, consisting of thousands of servers, is one of the largest services in existence. The tier has significant compute requirements at times of peak usage, but also demonstrates a significant diurnal load pattern based on active users. Dynamic sizing of this tier allows a significant amount of capacity to be freed for other purposes during off-peak times.
In this talk, we will present Throughput Autoscaling, the horizontal autoscaling strategy used by Facebook's web tier to free off-peak capacity while maintaining strong safety guarantees.
We will compare Throughput Autoscaling with more common approaches to the horizontal autoscaling problem. We will explore some of the deficiencies of these more common approaches and how Throughput Autoscaling addresses them to keep Facebook’s web tier both safe and efficient throughout the day.
Chris is a Production Engineer from Meta's Core Systems team. Having lived at the bottom of Meta's infrastructure stack for... read more