TOPIC: Data, Systems and Networking

Systems @Scale Remote Edition — Summer 2020

AUGUST 19, 2020 @ 11:00 AM PDT - 12:00 PM PDT
AUGUST 26, 2020 @ 11:00 AM PDT - 12:00 PM PDT
SEPTEMBER 02, 2020 @ 11:00 AM PDT - 12:00 PM PDT
SEPTEMBER 09, 2020 @ 11:00 AM PDT - 12:00 PM PDT
SEPTEMBER 16, 2020 @ 11:00 AM PDT - 12:00 PM PDT
Designed for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges.
RSVPS CLOSED
AGENDA SPEAKERS

ABOUT EVENT

Systems @Scale is a technical conference for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges. The @Scale community focuses on bringing people together to discuss these challenges and collaborate on the development of new solutions.

EVENT AGENDA

Event times below are displayed in PT.

August 19

August 26

September 2

September 9

September 16

11:00 AM - 12:00 PM
Asynchronous computing @Facebook: Driving efficiency and developer productivity at Facebook scale

A lot of work happens behind the scenes that requires time and processing that should not block real-time actions in our products. Things like notifications, event invites, and video rendering may entail long waits or require a large amount of processing power. In this talk, we will detail how Facebook handles asynchronous processing at large scale and the challenges that come with maintaining the reliability of a large multitenant system that's constantly growing in demand.

Learn more about Async here: https://engineering.fb.com/production-engineering/async/

Speaker Carla Souza,Facebook
11:00 AM - 12:00 PM
Wednesday, August 26 — Scaling services with Shard Manager

Driven by the growth of Facebook users and a richer product experience, the space of back-end sharded services has proliferated in the past decade. Since 2011, we have been pioneering the idea of abstracting out "sharding as a platform" to help stateful services scale easily.

In this talk, we present Shard Manager, a generic shard management platform, and share how it facilitates the development and operation of upper hundreds of diverse stateful sharded services totaling tens of millions of shards hosted on hundreds of thousands of servers. We will look at how Shard Manager is fully integrated in our infrastructure ecosystem and provides a holistic, end-to-end solution supporting not only basic shard failover but also sophisticated load balancing, shard scaling, and operational safety.

Speaker Gerald Guo,Facebook
Speaker Thawan Kooburat,Facebook
11:00 AM - 12:00 PM
Wednesday, September 2 — Containerizing Zookeeper: Powering container orchestration from within

At Facebook, virtually all our infrastructure is powered in some fashion by Apache Zookeeper. This includes service discovery, configuration management, package deployment, cluster management — every piece of our infrastructure must maintain commitments of consistency and durability in the face of machine failures, network partitions, and human error. More often than not, Zookeeper is the low-dependency metadata storage service of choice.

The infrastructure that Zookeeper powers comprises a ubiquitous cluster management platform, atop which the rest of Facebook’s software runs. This platform autonomously manages thousands of services across millions of machines, providing a huge degree of flexibility and leverage for engineers.

So when Facebook’s Zookeeper team decided that they, too, wanted this flexibility and leverage, it meant turning our dependency graph on its head. In this talk, we will present the 18-month journey that brought hundreds of Zookeeper ensembles in from the cold bare metal so they could safely run atop the cluster management platform that they make possible.

Speaker Christopher Bunn,Meta
11:00 AM - 12:00 PM
Wednesday, September 9 — Fault tolerance through optimal workload placement

Electrical faults, issues during routine maintenance, or even incidents such as a snake crawling into our power infrastructure are common occurrences in our data centers. These events cause machine outages for a failure domain within a data center but can often escalate to a full data center level failure for a service. A large contributor to this escalation is that different types of hardware and services are concentrated in specific failure domains within a data center region, instead of being well spread across all failure domains. Losing this one failure domain means that we lose a large portion of a given service. And as our data center fleet continues to quickly grow over the next several years, we expect the number of these failures to grow as well.

In this talk, we will focus on how we have started to optimize for the failure domain spread of our hardware, services, and data to ensure that the loss of any failure domain within a data center region leads to the smallest portion of any service or hardware type being affected. This allows us to build subregion fault tolerance expectations across our systems, including being able to support buying enough buffers to allow us to lose a fault domain without losing the entire data center, and without negative impact to our end users.

Speaker Elisa Shibley,Facebook
11:00 AM - 12:00 PM
Wednesday, September 16 — Throughput Autoscaling: Dynamic sizing for facebook.com

Facebook's web tier, consisting of thousands of servers, is one of the largest services in existence. The tier has significant compute requirements at times of peak usage, but also demonstrates a significant diurnal load pattern based on active users. Dynamic sizing of this tier allows a significant amount of capacity to be freed for other purposes during off-peak times.

In this talk, we will present Throughput Autoscaling, the horizontal autoscaling strategy used by Facebook's web tier to free off-peak capacity while maintaining strong safety guarantees.
We will compare Throughput Autoscaling with more common approaches to the horizontal autoscaling problem. We will explore some of the deficiencies of these more common approaches and how Throughput Autoscaling addresses them to keep Facebook’s web tier both safe and efficient throughout the day.

Speaker Daniel Boeve,Facebook

SPEAKERS AND MODERATORS

Carla Souza

Facebook

Gerald Guo

Facebook

Thawan Kooburat

Facebook

Chris is a Production Engineer from Meta's Core Systems team. Having lived at the... read more

Christopher Bunn

Meta

Elisa Shibley

Facebook

Daniel Boeve

Facebook
UPCOMING EVENT   May 22, 2024 Data @Scale

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...
UPCOMING EVENT   June 12, 2024 Systems @Scale

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...
UPCOMING EVENT   07/31/2024 AI @Scale

AI Infra @Scale 2024

Meta's Engineering and Infrastructure teams are excited to host AI Infra @Scale, a one-day virtual event featuring a range of speakers from Meta who will unveil the latest AI infrastructure investments and innovations powering Meta's...
UPCOMING EVENT   August 7, 2024 Product @Scale

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. This year focuses on discussions that explore the creator ecosystem, and how AI will play a role in scaling...
UPCOMING EVENT   September 4-5, 2024 (2 day event) Networking @Scale

Networking @Scale 2024

Networking @Scale is a technical conference for engineers that build and manage large-scale networks. Meta’s Networking Infrastructure team is excited to host Networking @Scale, a two-day virtual event featuring a range of speakers from Meta...
UPCOMING EVENT   September 25, 2024 Reliability @Scale

Reliability @Scale 2024

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems and products at massive scale. Whether it’s novel design decisions, or outages that impact billions...
UPCOMING EVENT   October 23, 2024 Mobile @Scale

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...
UPCOMING EVENT   November 20, 2024 Video @Scale

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...
PAST EVENT   March 20, 2024 @ 9am PT - 3pm PT RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy