Systems @Scale Spring 2022

Virtual 10:00am - 12:30pm

Event Completed

Systems @Scale Spring 2022 is a technical conference for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges. The @Scale community focuses on bringing people together to discuss these challenges and collaborate on the development of new solutions.

The Systems @Scale Spring 2022 series will be hosted virtually. Joining us are speakers from Microsoft, Google, Open Source Cassandra, and Meta. The event spans three weeks, with talks themed around reliability, resiliency, efficiency and storage at scale.

Starting March 23rd, for three weeks, we will livestream a recorded session followed by live panel discussion on Wednesdays.

March 23 – Storage Platforms
March 30 – Storage Platforms and Applied Privacy
March 31 – Reliability, Resiliency, Efficiency

Read More Read Less

Our Pledge
In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to make participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

Our Standards
Examples of behavior that contributes to creating a positive environment include:

  • Using welcoming and inclusive language
  • Being respectful of differing viewpoints and experiences
  • Gracefully accepting constructive criticism
  • Focusing on what is best for the community
  • Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

  • The use of sexualized language or imagery and unwelcome sexual attention or advances
  • Trolling, insulting/derogatory comments, and personal or political attacks
  • Public or private harassment
  • Publishing others’ private information, such as a physical or electronic address, without explicit permission
  • Other conduct which could reasonably be considered inappropriate in a professional setting

Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.

Scope
This Code of Conduct applies within all project spaces, and it also applies when an individual is representing the project or its community in public spaces. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.

This Code of Conduct also applies outside the project spaces when there is a reasonable belief that an individual's behavior may have a negative impact on the project or its community.

Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at opensource-conduct@fb.com. All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project’s leadership.

Attribution
This Code of Conduct is adapted from the Contributor Covenant, version 1.4, available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html

For answers to common questions about this code of conduct, see https://www.contributor-covenant.org/faq

Event Completed
Agenda
10:00am - 10:20am

Wednesday, March 23 - Transparent Memory Offloading @Met

The unrelenting growth of the memory needs of emerging data center applications, along with ever-increasing cost and volatility of DRAM prices, has led to DRAM being a major infrastructure expense. Alternative technologies, such as NVMe SSDs and upcoming NVM devices, offer higher capacity than DRAM at a fraction of the cost and power. One promising approach is to transparently offload colder memory to cheaper memory technologies via kernel or hypervisor techniques. The key challenge, however, is to develop a datacenter-scale solution that is robust in dealing with diverse workloads and large performance variance of different offload devices such as compressed memory, SSD, and NVM. In this talk, we will present TMO, Meta’s transparent memory offloading solution for heterogeneous data center environments. TMO introduces a new Linux kernel mechanism that directly measures in real time the lost work due to resource shortage across CPU, memory, and I/O. Guided by this information and without any prior application knowledge, TMO automatically adjusts how much memory to offload to heterogeneous devices (e.g., compressed memory or SSD) according to the device’s performance characteristics and the application’s sensitivity to memory-access slowdown. TMO holistically identifies offloading opportunities from not only the application containers but also the sidecar containers that provide infrastructure-level functions. TMO has been running in production for more than a year, and has saved between 20-32% of the total memory across millions of servers in our large datacenter fleet. We have successfully upstreamed TMO into the Linux kernel.
10:20am - 10:40am

Wednesday, March 23 - Manifold: Storage Platform Consolidation

Since 2016, we’ve built, deployed, and scaled a new BLOB storage platform at Meta, called Manifold. Manifold builds on existing BLOB storage infrastructure, but provides a richer, higher-level, general purpose API, and thereby enables teams using storage at Meta move faster. Moreover, by creating a stronger abstraction between storage consumers and storage infrastructure, Manifold enables storage infrastructure teams to move faster too, but realizing this vision requires migrating thousands of production workloads totaling exabytes of production data, and rebuilding multiple systems. We’ll describe our approach.
10:40am - 11:00am

Wednesday, March 23 - Cassandra@Scale: A Deep Dive into Apache Cassandra 4.0

At almost two years in the making Apache Cassandra 4.0 is here. With a focus on performance and stability, it is full of interesting features. This talk takes you through a tour of the new features and performance improvements. From Zero Copy Faster Streaming support to Virtual Tables and Audit Logging, learn how Apache Cassandra 4.0 has departed from previous major releases to be focused on the features required by the most demanding users. Attendees will not only gain an understanding of new features in Apache Cassandra, but understand what drove their creation and therefore have a better idea of how to best use these features in their own systems.
11:00am - 11:15am

Wednesday, March 23 - Scaling Data Ingestion for ML Training at Meta

AI models drive several Meta products like News Feed, Ads, IG Reels, language translation to name a few. Our ranking models consume massive datasets to continuously improve user experience on our platform. In this talk, we discuss our experience of building infrastructure to serve massive scale data to the 1000s of AI models driving our products. Further, we present AI training data pipeline workload characteristics and challenges in scaling these systems for industry-scale use cases. Inefficiencies in these pipelines result in expensive wasted GPU/accelerator training resources in our data centers. In this talk, we outline our experience in optimizing these data ingestion pipelines and our plans to continue innovation in this space.
11:15am - 12:00pm

Wednesday, March 23 - Q&A

Q&A with Niket Agarwal, Dan Schatzberg, Johannes Weiner, Jacob Lacouture, Dinesh Joshi, and Aarti Basant
10:00am - 10:20am

Wednesday, March 30 - The Cosmos Big Data Platform at Microsoft: Over a Decade of Progress and a Decade to Look Forward

Cosmos is the exabyte-scale big data platform at Microsoft, and SCOPE is its main analytics engine. SCOPE and Cosmos support ETL pipelines, decision support systems, and machine learning pipelines. Applications range from simple performance counter aggregation, to tracking the deployment of Windows to billions of devices, to Azure billing, Xbox gamer reputation, Bing relevance and index creation, to Ads campaign tuning. SCOPE is a mature, feature rich, and highly extensible data querying language and engine. It allows user operators and types written in C#, Python and through those, any other programming language. By transparently splitting the work, it allows users to author simple extensions that are executed in parallel transparently. Cosmos clusters include bare metal machines and VMs, depending on price/performance targets and a wide range of security requirements. In this talk we will demo the SCOPE developer experience in Visual Studio, and explore the system behind it by following the lifetime of a job in production. We will cover the basic functionality and architecture, and how we leverage a vertically integrated hardware and software stack design. We will cover utilization, reliability, cost control and how Cosmos and SCOPE thrived while handling 12x server growth, 188x logical data growth, and 108x job growth over the past decade. You will have a glimpse of a system that completes ~100k tasks per second, reads and writes hundreds of PBs of data per day, and supports jobs with several million containers running in parallel while costing pennies per TB of data processed.
10:20am - 10:40am

Wednesday, March 30 - ACS: De-Identified Authentication at Scale

Privacy is core to Meta engineering culture, and one of our fundamental principles is data minimization. We strive to collect and create the minimum amount of data required to provide service. One critical space we’ve identified across the industry is to avoid processing user data with identity. Anonymous Credentials Service (ACS) is a service developed by the Applied Privacy Technology Team to provide a solution at Meta scale. ACS enables clients to authenticate in a de-identified manner. By eliminating user-id in authentication, we preserve user security and meet our data minimization goals. We accomplish this by decoupling authentication into a pre-auth phase (aka credential issuance) and a de-identified request phase where the de-identified request is sent with business data. In this talk, we share an overview of ACS design and deployment, learnings from reliably scaling up to heavy workloads, strategies to support multi-tenancy, and a preview of the future work toward integrating de-identified services with Meta products.
10:40am - 11:00am

Wednesday, March 30 - Highly Available and Strongly Consistent Storage Service Using Chain Replication (Dumbo)

In this talk, we present Dumbo - a simple, reliable, highly available, low dependency object storage system that powers Meta’s build and distribution infrastructure. Dumbo is designed to be extremely simple and have minimal dependencies so as to cater to critical package workloads. This is vital to ensure a seamless disaster recovery and bootstrap strategy. Dumbo leverages chain replication as an approach to coordinate clusters of fail-stop storage servers. We talk about how this compares to classical quorum based solutions in terms of simplicity, performance, consistency and other relevant considerations. We walk-through our journey from a research-paper inspired prototype solution to a fully functional large scale storage service in production. We discuss the milestones, learnings and key observations along the way. We talk about Dumbo’s evolution which started with supporting Meta’s build infrastructure to positioning itself as a gateway to all archival storages within Meta, thereby facilitating the disaster recovery story for core Meta services. We finally wrap with recent feature advancements in our service like cross region replication, backup and restore, recovery provider as a service.
11:00am - 11:30am

Wednesday, March 30 - Q&A

Q&A with Ivan Santa Maria Filho, Shiv Kushwah, Haozhi Xiong, Kumar Mrinal, and Binbin Lu
10:00am - 10:30am

Thursday, March 31 - Shrinking the Impact of Production Incidents

Shrinking Production Incidents details an organized approach for reducing the overall impact of production outages. Attendees can expect to learn how to prioritize reliability-related engineering tasks based on incident postmortem data, focusing on tasks that 1) Reduce time of detection of the incident, 2) Shorten the time to repair, and 3) Expand the time between failures.
10:30am - 11:00am

Thursday, March 31 - Vacuum Testing for Resiliency: Verifying Disaster Recovery in Complex

Engineers at Meta run thousands of services across millions of machines, and those services all have similar needs that can’t be managed by hand: configuration, deployment, monitoring, routing, orchestration, security. To solve the ever-growing complexities of running in production, we build a lot of infrastructure. But infrastructure is just software, and it has those same needs too. So what powers those systems? We like to imagine a neat, tidy stack of yet more, precisely layered infrastructure, where everything is organized into a faultless, acyclic graph of dependencies. Turtles all the way down. But in reality, the lower in the stack you go, the more you find infrastructure is - needs to be - deeply intertwined. After all, infra systems need to run on millions of servers, just like the rest of production. This can present a scary prospect: if something breaks, how do we know we'll even be able to turn things back on again? In this talk, we'll present BellJar, a new technique we've developed to exercise critical infrastructure in environments where nearly all of Meta's supporting services don’t work. By providing precisely tailored broken environments - each of which is specially vacuum sealed away from the infra we take for granted - we can learn exactly what it takes to bring each system back online during widespread outage. And by incorporating this tooling into our delivery pipeline, we can incrementally bake resiliency back into our ever-growing infrastructure layer cake, even as it continues to evolve.
11:00am - 11:30am

Thursday, March 31 - Southpaw: Token-Based Service Load Balancing, Scaling And QoS System

Southpaw is load balancing, scaling and QoS management system for compute-heavy inferencing services. It takes the approach of abstracting services capabilities into tokens and worklanes, where clients are granted tokens that gives them access to service instance worklanes. By adopting this simple abstraction, we can unify how we view work and load in a heterogeneous cluster in terms of tokens supply and demand. We then can push compute efficiency much higher while controlling scaling and end-to-end quality of service. Today, Southpaw has been running all global Microsoft speech workloads for the past 3 years and have contributed to increasing average compute efficiency by 5 times
11:30am - 12:00pm

Thursday, March 31 - Owl

We will describe Owl, a new system for high-fanout distribution of large data objects to hosts in Meta's private cloud. Owl distributes over 700 petabytes of data per day to millions of client processes. It has improved download speeds and cache hit rate by a factor of 2-3 over BitTorrent and other prior systems used for distribution Meta.
12:00pm - 12:30pm

Thursday, March 31 - Q&A

Q&A with Yuri Grinshteyn, Jie Huang, Christopher Bunn, Osama Abuelsorour, Amr Hanafi, and Arushi Aggarwal

Join the @Scale Mailing List and Get the Latest News & Event Info

Code of Conduct

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy