Systems @Scale Remote Edition — Summer 2020

AUGUST 19, 2020 @ 11:00 AM PDT - 12:00 PM PDT

AUGUST 26, 2020 @ 11:00 AM PDT - 12:00 PM PDT

SEPTEMBER 02, 2020 @ 11:00 AM PDT - 12:00 PM PDT

SEPTEMBER 09, 2020 @ 11:00 AM PDT - 12:00 PM PDT

SEPTEMBER 16, 2020 @ 11:00 AM PDT - 12:00 PM PDT

Designed for engineers and technologists who specialize and find interest in how information moves and flows throughout products.

RSVPS CLOSED

AGENDA SPEAKERS

ABOUT EVENT

Systems @Scale is a technical conference for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges. The @Scale community focuses on bringing people together to discuss these challenges and collaborate on the development of new solutions.

EVENT AGENDA

Event times below are displayed in PT.

August 19

August 26

September 2

September 9

September 16

11:00 AM - 12:00 PM

Asynchronous computing @Facebook: Driving efficiency and developer productivity at Facebook scale

WATCH NOW

A lot of work happens behind the scenes that requires time and processing that should not block real-time actions in our products. Things like notifications, event invites, and video rendering may entail long waits or require a large amount of processing power. In this talk, we will detail how Facebook handles asynchronous processing at large scale and the challenges that come with maintaining the reliability of a large multitenant system that's constantly growing in demand.

Learn more about Async here: https://engineering.fb.com/production-engineering/async/

Speaker Carla Souza,Facebook

11:00 AM - 12:00 PM

Wednesday, August 26 — Scaling services with Shard Manager

WATCH NOW

Driven by the growth of Facebook users and a richer product experience, the space of back-end sharded services has proliferated in the past decade. Since 2011, we have been pioneering the idea of abstracting out "sharding as a platform" to help stateful services scale easily.

In this talk, we present Shard Manager, a generic shard management platform, and share how it facilitates the development and operation of upper hundreds of diverse stateful sharded services totaling tens of millions of shards hosted on hundreds of thousands of servers. We will look at how Shard Manager is fully integrated in our infrastructure ecosystem and provides a holistic, end-to-end solution supporting not only basic shard failover but also sophisticated load balancing, shard scaling, and operational safety.

Speaker Gerald Guo,Meta

Speaker Thawan Kooburat,Facebook

11:00 AM - 12:00 PM

Wednesday, September 2 — Containerizing Zookeeper: Powering container orchestration from within

WATCH NOW

At Facebook, virtually all our infrastructure is powered in some fashion by Apache Zookeeper. This includes service discovery, configuration management, package deployment, cluster management — every piece of our infrastructure must maintain commitments of consistency and durability in the face of machine failures, network partitions, and human error. More often than not, Zookeeper is the low-dependency metadata storage service of choice.

The infrastructure that Zookeeper powers comprises a ubiquitous cluster management platform, atop which the rest of Facebook’s software runs. This platform autonomously manages thousands of services across millions of machines, providing a huge degree of flexibility and leverage for engineers.

So when Facebook’s Zookeeper team decided that they, too, wanted this flexibility and leverage, it meant turning our dependency graph on its head. In this talk, we will present the 18-month journey that brought hundreds of Zookeeper ensembles in from the cold bare metal so they could safely run atop the cluster management platform that they make possible.

Speaker Christopher Bunn,Meta

11:00 AM - 12:00 PM

Wednesday, September 9 — Fault tolerance through optimal workload placement

WATCH NOW

Electrical faults, issues during routine maintenance, or even incidents such as a snake crawling into our power infrastructure are common occurrences in our data centers. These events cause machine outages for a failure domain within a data center but can often escalate to a full data center level failure for a service. A large contributor to this escalation is that different types of hardware and services are concentrated in specific failure domains within a data center region, instead of being well spread across all failure domains. Losing this one failure domain means that we lose a large portion of a given service. And as our data center fleet continues to quickly grow over the next several years, we expect the number of these failures to grow as well.

In this talk, we will focus on how we have started to optimize for the failure domain spread of our hardware, services, and data to ensure that the loss of any failure domain within a data center region leads to the smallest portion of any service or hardware type being affected. This allows us to build subregion fault tolerance expectations across our systems, including being able to support buying enough buffers to allow us to lose a fault domain without losing the entire data center, and without negative impact to our end users.

Speaker Elisa Shibley,Facebook

11:00 AM - 12:00 PM

Wednesday, September 16 — Throughput Autoscaling: Dynamic sizing for facebook.com

WATCH NOW

Facebook's web tier, consisting of thousands of servers, is one of the largest services in existence. The tier has significant compute requirements at times of peak usage, but also demonstrates a significant diurnal load pattern based on active users. Dynamic sizing of this tier allows a significant amount of capacity to be freed for other purposes during off-peak times.

In this talk, we will present Throughput Autoscaling, the horizontal autoscaling strategy used by Facebook's web tier to free off-peak capacity while maintaining strong safety guarantees.
We will compare Throughput Autoscaling with more common approaches to the horizontal autoscaling problem. We will explore some of the deficiencies of these more common approaches and how Throughput Autoscaling addresses them to keep Facebook’s web tier both safe and efficient throughout the day.

Speaker Daniel Boeve,Facebook

SPEAKERS AND MODERATORS

Carla Souza

Facebook

Gerald is a Research Scientist at Meta, and working on building a global service... read more

Gerald Guo

Thawan Kooburat

Facebook

Chris is a Production Engineer from Meta's Core Systems team. Having lived at the... read more

Christopher Bunn

Elisa Shibley

Facebook

Daniel Boeve

Facebook

past EVENT November 20-21, 2024 | Mobile, Video and Web

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...

PAST EVENT March 20, 2024 @ 9am PT - 3pm PT | Mobile, Video and Web

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

Past EVENT May 22, 2024 | Data, Machine Learning and AI

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...

Past EVENT June 12, 2024 | Systems and Networking

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...

Past EVENT JULY 31, 2024 @ 2:30 PM PDT - 7:00 PM PDT - IN PERSON EVENT | AUGUST 7, 2024 @ 2:30 PM PDT - 5:30 PM PDT - VIRTUAL PROGRAM | Data, Machine Learning and AI

AI Infra @Scale 2024

Meta’s Engineering and Infrastructure teams are excited to return for the second year in a row to host AI Infra @Scale on July 31. This year’s event is open to a limited number of in-person...

Past EVENT August 14, 2024 | Mobile, Video and Web

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. The @Scale community focuses on bringing forward people's experiences in creating innovative solutions to large-scale products serving millions or...

Past EVENT September 11, 2024 | Santa Clara Convention Center | Systems and Networking

Networking @Scale 2024

Meta’s Networking team invites you to Networking@scale on September 11th. This year’s event is an in-person event hosted at the Santa Clara Convention center and will also be live streamed for virtual attendees. Registration is...

Past EVENT October 9, 2024 | Systems and Networking

Reliability @Scale 2024

In the digital age, where systems operate at unprecedented scales, the importance of robust configuration management cannot be overstated. This year’s Reliability @Scale will focus on a central theme of "Move Safely", emphasizing the critical...

Past EVENT October 23, 2024 | Mobile, Video and Web

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...