Systems @Scale Remote Edition — Summer 2020

Virtual 11:00am - 12:00pm

Event Completed
Read More Read Less

@Scale brings thousands of engineers together throughout the year to discuss complex engineering challenges and to work on the development of new solutions. We're committed to providing a safe and welcoming environment — one that encourages collaboration and sparks innovation.

Every @Scale event participant has the right to enjoy his or her experience without fear of harassment, discrimination, or condescension. The @Scale code of conduct outlines the behavior that we support and don't support at @Scale events and conferences. We expect participants to follow these rules at all @Scale event venues, online communities, and event-related social activities. These guidelines will keep the @Scale community a safe and enjoyable one for everyone.

Be welcoming. Everyone is welcome at @Scale events, inclusive of (but not limited to) gender, gender identity or expression, sexual orientation, body size, differing abilities, ethnicity, national origin, language, religion, political beliefs, socioeconomic status, age, color and neurodiversity. We have a zero-tolerance policy for discrimination.

Choose your words carefully. Treat one another with respect and in a professional manner. We're here to collaborate. Conflict is not part of the equation.

Know where the line is, and don't cross it. Harassment, threats, or intimidation of any kind will not be tolerated. This includes verbal, physical, sexual (such as sexualized imagery on clothing, presentations, in print, or onscreen), written, or any other form of aggression (whether outright, subtle, or micro). Behavior that is offensive, as determined by @Scale organizers, security staff, or conference management, will not be tolerated. Participants who are asked to stop a behavior or an action are expected to comply immediately or will be asked to leave.

Don't be afraid to call out bad behavior. If you're the target of harmful or offensive behavior, or if you witness someone else being harassed, threatened, or intimidated, don't look away. Tell an @Scale staff member, a security staff member, or a conference organizer immediately. Please notify our event staff, security staff, or conference organizers of any harmful or offensive behavior that you've experienced or witnessed in any form, whether in person or online.

We at @Scale want our events to be safe for everyone, and we have a zero-tolerance policy for violations of our code of conduct. @Scale conference organizers will investigate any allegation of problematic behavior, and we will respond accordingly. We reserve the right to take any follow-up actions we determine are needed. These include being warned, being refused admittance, being ejected from the conference with no refund, and being banned from future @Scale events.

Event Completed
11:00am - 12:00pm

Wednesday, August 19 — Asynchronous computing @Facebook: Driving efficiency and developer productivity at Facebook scale

A lot of work happens behind the scenes that requires time and processing that should not block real-time actions in our products. Things like notifications, event invites, and video rendering may entail long waits or require a large amount of processing power. In this talk, we will detail how Facebook handles asynchronous processing at large scale and the challenges that come with maintaining the reliability of a large multitenant system that's constantly growing in demand. Learn more about Async here:
11:00am - 12:00pm

Wednesday, August 26 — Scaling services with Shard Manager

Driven by the growth of Facebook users and a richer product experience, the space of back-end sharded services has proliferated in the past decade. Since 2011, we have been pioneering the idea of abstracting out "sharding as a platform" to help stateful services scale easily. In this talk, we present Shard Manager, a generic shard management platform, and share how it facilitates the development and operation of upper hundreds of diverse stateful sharded services totaling tens of millions of shards hosted on hundreds of thousands of servers. We will look at how Shard Manager is fully integrated in our infrastructure ecosystem and provides a holistic, end-to-end solution supporting not only basic shard failover but also sophisticated load balancing, shard scaling, and operational safety.
11:00am - 12:00pm

Wednesday, September 2 — Containerizing Zookeeper: Powering container orchestration from within

At Facebook, virtually all our infrastructure is powered in some fashion by Apache Zookeeper. This includes service discovery, configuration management, package deployment, cluster management — every piece of our infrastructure must maintain commitments of consistency and durability in the face of machine failures, network partitions, and human error. More often than not, Zookeeper is the low-dependency metadata storage service of choice. The infrastructure that Zookeeper powers comprises a ubiquitous cluster management platform, atop which the rest of Facebook’s software runs. This platform autonomously manages thousands of services across millions of machines, providing a huge degree of flexibility and leverage for engineers. So when Facebook’s Zookeeper team decided that they, too, wanted this flexibility and leverage, it meant turning our dependency graph on its head. In this talk, we will present the 18-month journey that brought hundreds of Zookeeper ensembles in from the cold bare metal so they could safely run atop the cluster management platform that they make possible.
11:00am - 12:00pm

Wednesday, September 9 — Fault tolerance through optimal workload placement

Electrical faults, issues during routine maintenance, or even incidents such as a snake crawling into our power infrastructure are common occurrences in our data centers. These events cause machine outages for a failure domain within a data center but can often escalate to a full data center level failure for a service. A large contributor to this escalation is that different types of hardware and services are concentrated in specific failure domains within a data center region, instead of being well spread across all failure domains. Losing this one failure domain means that we lose a large portion of a given service. And as our data center fleet continues to quickly grow over the next several years, we expect the number of these failures to grow as well. In this talk, we will focus on how we have started to optimize for the failure domain spread of our hardware, services, and data to ensure that the loss of any failure domain within a data center region leads to the smallest portion of any service or hardware type being affected. This allows us to build subregion fault tolerance expectations across our systems, including being able to support buying enough buffers to allow us to lose a fault domain without losing the entire data center, and without negative impact to our end users.
11:00am - 12:00pm

Wednesday, September 16 — Throughput Autoscaling: Dynamic sizing for

Facebook's web tier, consisting of thousands of servers, is one of the largest services in existence. The tier has significant compute requirements at times of peak usage, but also demonstrates a significant diurnal load pattern based on active users. Dynamic sizing of this tier allows a significant amount of capacity to be freed for other purposes during off-peak times. In this talk, we will present Throughput Autoscaling, the horizontal autoscaling strategy used by Facebook's web tier to free off-peak capacity while maintaining strong safety guarantees. We will compare Throughput Autoscaling with more common approaches to the horizontal autoscaling problem. We will explore some of the deficiencies of these more common approaches and how Throughput Autoscaling addresses them to keep Facebook’s web tier both safe and efficient throughout the day.

Join the @Scale Mailing List and Get the Latest News & Event Info

Code of Conduct

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy