Systems @Scale Spring 2021

FEBRUARY 24, 2021 @ 11:00 AM PST - 1:00 PM PST

MARCH 03, 2021 @ 11:00 AM PST - 1:00 PM PST

MARCH 10, 2021 @ 11:00 AM PST - 1:00 PM PST

MARCH 17, 2021 @ 11:00 AM PDT - 2:00 PM PDT

Designed for engineers and technologists who specialize and find interest in how information moves and flows throughout products.

RSVPS CLOSED

AGENDA SPEAKERS

ABOUT EVENT

Systems @Scale is a technical conference for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges. The @Scale community focuses on bringing people together to discuss these challenges and collaborate on the development of new solutions.

The 2021 series will be hosted virtually. Joining us are speakers from Facebook, Stanford, VMWare, Google, and Temporal, with topics spanning queuing, workflows, shared storage fabric, consistency@scale and consensus protocols in the context of large-scale distributed systems.

Starting February 24th, for four weeks, we will share the topics via a blog post on Monday followed by a recorded session and live Q&A on Wednesdays.

EVENT AGENDA

Event times below are displayed in PT.

February 24

March 3

March 10

March 17

11:00 AM - 11:30 AM

Scaling A Distributed Priority Queue

Facebook Ordered Queue Service (FOQS) is a distributed priority queue that powers 100s of microservices across the Facebook stack. Use cases of FOQS range from video infra, to data infra, to ai infra, to many others. FOQS is horizontally scalable and supports rich semantics such as deferred delivery and ordering by a 32 bit priority. FOQS was conceived about 3 years ago to be a job queue. Since then, FOQS has seen exponential growth and now processes close to 700 billion items per day. In this talk, we present why we built FOQS, FOQS' architecture, and some of the scaling challenges that we have solved and have yet to solve. In particular, we will look at some of the complexities that come with building a globally distributed queue to handle Facebook's scale.

Speaker Akshay Nanavati,Facebook

Speaker Niharika Niharika Devanathan,Facebook

11:30 AM - 12:00 PM

Breaking Magical Barriers

Building a 300 miles per hour production car is hard. Speed is not a linear measurement, even though you think of it that way. Taking a car from 150mph to 300mph requires eight times more power, or O(n^3) in our language, which is incredibly challenging. And there is a lot more to it than power. Pushing the maximum of a single replicated RabbitMQ queue from 25k messages per second to 1 million took a lot more than better algorithms. And no, we didn't change the programming language or the code VM. In order to understand this, we first need to have a clear picture of the building blocks and what the limits are. Next, we look at the solution, and conclude by identifying the upcoming developments in the wider technology sector. They will help us build faster and more efficient queueing systems. You can contribute by attending this talk, trying out what team RabbitMQ has built for 2021, and sharing your ideas on how to make it even better. We are better together.

Speaker Gerhard Lazu,VMware

12:00 PM - 01:00 PM

Live QA

WATCH NOW

11:00 AM - 11:30 AM

Optimizing Video Storage Via Semantic Replication

Facebook has its own BLOB Storage infrastructure that is responsible for storing trillions of BLOBs. For the Storage Infrastructure these objects are just random bytes of different sizes, which have to be stored reliably. At the user level, however, they could be the representation of a picture or video that they have uploaded to Facebook. Each picture or video that gets uploaded by a user goes through several processing pipelines, which can produce multiple different logically equivalent representations of the same entity. e.g. a video might be transcoded into multiple encodings with different resolutions/codecs. Although these encodings are different blobs, they represent the same video and any of these can be served to users, based on various parameters like network speed, available codecs and so on. In that context, the blobs could potentially be related. These different blobs representing the same logical object are called ‘Semantic Replicas’. Given that any of the Semantic Replicas can be served to the user, we can improve the availability & read reliability of the logical asset by storing different Semantic Replicas in different failure domains. In case of videos, not all videos are equal. Some videos get watched more than others and the probability of a video being watched reduces significantly over time. This information can be used to store videos more efficiently and reduce the storage cost per byte while maintaining the user experience. Chidambaram and Dan will talk about end-2-end architecture of the systems being built to achieve optimizations across Facebook stack that enhance user experience and lower the storage cost per byte.

Speaker Dan Danaila,Facebook

Speaker Chidambaram Muthu,Facebook

To learn more, view the resources below:

Optimizing Video Storage... read more

11:30 AM - 12:00 PM

Multi-Purpose Cloud Filesystem

Every business needs a robust backup and disaster recovery (DR) solution. DR has always been expensive, complex, and error prone. We have converted this complex backup and DR process into a simple to use SaaS solution. Solving this problem required a new type of multi-purpose cloud filesystem. In this session, we briefly describe the business problem, and dive into the design of our cloud filesystem.

Speaker Sazzala Reddy,VMware

12:00 PM - 01:00 PM

Live Q&A

WATCH NOW

11:00 AM - 11:30 AM

Workflows@Facebook: Powering Developer Productivity and Automation at Facebook Scale

At Facebook, several large-scale distributed applications like distributed video encoding and machine learning require a series of steps to be performed with dependencies between those. Building and operating large scale distributed systems is a hard problem in itself - hard to develop for, to reason about, and to debug. One way to deal with this complexity is to raise the level of abstraction and look at these problems which involve a series of steps to follow as workflows. A workflow framework is a way of keeping the complexity under control and let developers focus on the business process they care about. The workflow framework can take care of the heavy lifting around orchestration, debuggability, reliability, and state management. Dan Shiovitz and Girish Joshi will talk about the past, present, and future of workflows at Facebook: How several workflow systems have evolved at Facebook and cover the taxonomies and tradeoffs of workflow authoring styles they support; we will go over experiences from building a couple of general purpose workflow systems to cover broad workflow needs at Facebook, scaling those from zero to billions of workflows a day while staying reliable and observable and attempts and challenges to converge these general purpose workflow systems to a unified workflow system.

Speaker Dan Shiovitz,Facebook

Speaker Girish Joshi,Facebook

To learn more, view the resources below:

Workflows@Facebook: Powering... read more

11:30 AM - 12:00 PM

Designing a Workflow Engine From First Principles

Temporal is a complex distributed system. But most of its complexity is not accidental. Maxim’s presentation reveals how the core design decisions resulted from applying first principles to satisfy the few basic requirements.

Speaker Maxim Fateev,Temporal Technologies

12:00 AM - 01:00 PM

Live Q&A

11:00 AM - 11:30 AM

FlightTracker: Social Graph Consistency at Scale

Facebook delivers fresh personalized content by performing a large number of reads to our online data store TAO. TAO is read optimized and handles billions of queries per second. To achieve high efficiency, organizational scalability, and optimize for different workloads, Facebook has traditionally relied on asynchronous replication which presents consistency challenges: application developers must handle or build guardrails against potentially stale reads. Historically, TAO alleviated this burden by providing read-your-writes (RYW) consistency by default using write-through caching. This strategy fell short as our social graph ecosystem expanded to include global secondary indexing as well as other backend data stores. We built FlightTracker to manage consistency for the social graph at Facebook. By decomposing the consistency problem, FlightTracker provides uniform semantics across heterogenous data stores. FlightTracker allowed us to evolve beyond write-through caching and fixed communication patterns for higher reliability for TAO. FlightTracker maintains the efficiency, latency, and availability advantages of async replication, preserves the decoupling of our systems, and achieves organization scalability. Through FlightTracker, we provide user-centric RYW as a baseline while allowing select use cases such as realtime pub-sub systems like GraphQL Subscriptions to obtain higher levels of consistency. This talk will introduce how FlightTracker provides RYW consistency at scale and focus on how FlightTracker works for global secondary indexing systems. For more details, please refer to our blog post or our paper in OSDI’20.

Speaker Xiao Shi,Facebook

Speaker Scott Pruett,Facebook

To learn more, view the resources below:

FlightTracker: Social Graph... read more

11:30 AM - 12:00 PM

Virtualizing Consensus in Delos for Rapid Upgrades and Happy Engineers

Delos is a control plane storage platform used within Facebook. Delos employs a novel technique called Virtual Consensus that allows any of its components to be replaced without downtime. Virtual Consensus enables rapid upgrades: engineers can evolve the system over time, making it faster, more efficient, and more reliable. Separately, Delos lowers the overhead of building and maintaining a database by providing a common platform for consensus and local storage that can be re-used across different top-level APIs. We currently run two Delos databases in production, DelosTable (a table store) and Zelos (a ZooKeeper clone); in the process, we amortize development cycles and operational know-how across the services, while benefiting from cross-cutting performance and reliability improvements.

Speaker Mahesh Balakrishnan,Facebook

To learn more, view the resources below:

Virtualizing Consensus in Delos... read more

12:00 PM - 12:30 PM

Mergeable Summaries for High Scale, High Cardinality Analytics

Querying massive collections of events, sessions, and logs is expensive – and even moreso when slicing and dicing by the increasing number of dimensions attached to each piece of data (e.g., browser type, device, feature flag). In this talk, I'll describe how adapting a now-classic technique from distributed consistency – the design of commutative and associated data structures (e.g., CRDTs) – can simplify the problem of storing, querying, and centralizing large scale, high cardinality datasets. The resulting mergeable summaries provide useful, scalable primitives for answering complex queries – including frequent items, quantiles, and cardinality estimates – over high cardinality slices. In addition to introducing recent research on this topic, I'll discuss lessons from productionizing mergeable summaries at Microsoft and Sisu and in Apache Druid.

Speaker Peter Bailis,Sisu

12:30 PM - 01:00 PM

Spanner: Order and Consistency at Scale

Spanner is Google's highly available, replicated SQL database. In this talk we'll explore how Spanner uses globally synchronized time (TrueTime) to provide applications external consistency and lockless consistent snapshot reads, its performance impact and how application can adjust their consistency expectations for better performance.

Speaker Dan Chia,Google

01:00 PM - 02:00 PM

Live Q&A

SPEAKERS AND MODERATORS

Akshay Nanavati

Facebook

Niharika Niharika Devanathan

Facebook

Gerhard Lazu

VMware

Dan Danaila

Facebook

Chidambaram Muthu

Facebook

Sazzala Reddy

VMware

Dan Shiovitz

Facebook

Girish Joshi

Facebook

Maxim Fateev

Temporal Technologies

Xiao Shi

Facebook

Scott Pruett

Facebook

Mahesh Balakrishnan

Facebook

Peter Bailis

Sisu

Dan Chia

Google

past EVENT November 20-21, 2024 | Mobile, Video and Web

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...

PAST EVENT March 20, 2024 @ 9am PT - 3pm PT | Mobile, Video and Web

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

Past EVENT May 22, 2024 | Data, Machine Learning and AI

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...

Past EVENT June 12, 2024 | Systems and Networking

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...

Past EVENT JULY 31, 2024 @ 2:30 PM PDT - 7:00 PM PDT - IN PERSON EVENT | AUGUST 7, 2024 @ 2:30 PM PDT - 5:30 PM PDT - VIRTUAL PROGRAM | Data, Machine Learning and AI

AI Infra @Scale 2024

Meta’s Engineering and Infrastructure teams are excited to return for the second year in a row to host AI Infra @Scale on July 31. This year’s event is open to a limited number of in-person...

Past EVENT August 14, 2024 | Mobile, Video and Web

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. The @Scale community focuses on bringing forward people's experiences in creating innovative solutions to large-scale products serving millions or...

Past EVENT September 11, 2024 | Santa Clara Convention Center | Systems and Networking

Networking @Scale 2024

Meta’s Networking team invites you to Networking@scale on September 11th. This year’s event is an in-person event hosted at the Santa Clara Convention center and will also be live streamed for virtual attendees. Registration is...

Past EVENT October 9, 2024 | Systems and Networking

Reliability @Scale 2024

In the digital age, where systems operate at unprecedented scales, the importance of robust configuration management cannot be overstated. This year’s Reliability @Scale will focus on a central theme of "Move Safely", emphasizing the critical...

Past EVENT October 23, 2024 | Mobile, Video and Web

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...