Summer Systems @Scale 2022

Virtual 10:00am - 11:30am


Systems @Scale Summer 2022 is a technical conference for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges. The @Scale community focuses on bringing people together to discuss these challenges and collaborate on the development of new solutions.

The Systems @Scale Summer 2022 series will be hosted virtually. Joining us are speakers from Alibaba, Confluent, Google, HashiCorp, Meta, Microsoft, and Rockset. The event spans four weeks, with talks themed around performance, reliability, efficiency, and managing services at scale.

Starting June 8th, for four weeks, we will livestream a recorded session followed by live Q&A sessions on Wednesdays.

Week 1 – June 8: Performance, Reliability, and Efficiency
Week 2 – June 15: Reliability
Week 3 – June 22: Managing Services: Part 1
Week 4 – June 29: Managing Services: Part 2

Read More Read Less

Our Pledge
In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to make participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

Our Standards
Examples of behavior that contributes to creating a positive environment include:

  • Using welcoming and inclusive language
  • Being respectful of differing viewpoints and experiences
  • Gracefully accepting constructive criticism
  • Focusing on what is best for the community
  • Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

  • The use of sexualized language or imagery and unwelcome sexual attention or advances
  • Trolling, insulting/derogatory comments, and personal or political attacks
  • Public or private harassment
  • Publishing others’ private information, such as a physical or electronic address, without explicit permission
  • Other conduct which could reasonably be considered inappropriate in a professional setting

Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.

This Code of Conduct applies within all project spaces, and it also applies when an individual is representing the project or its community in public spaces. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.

This Code of Conduct also applies outside the project spaces when there is a reasonable belief that an individual's behavior may have a negative impact on the project or its community.

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project’s leadership.

This Code of Conduct is adapted from the Contributor Covenant, version 1.4, available at

For answers to common questions about this code of conduct, see

10:00am - 10:20am

Week 1 - June 8: Performance, Reliability, and Efficiency - Cache Made Consistent – Cache invalidation might no longer be a hard thing in Computer Science

Cache invalidation is considered one of the hardest things in Computer Science. We, at Meta, operate some of the world's largest cache deployments (e.g. Memcache and TAO), serving more than one quadrillion queries a day. We have developed a systemic approach at diagnosing inconsistencies from cache invalidations at scale.
10:20am - 10:40am

Week 1 - June 8: Performance, Reliability, and Efficiency - Leveraging Data in Motion in a Cloud-first World

Apache Kafka has emerged to the de-facto standard for event streaming platform in enterprise architectures. Many business applications are moving away from data-at-rest to an event-driven architecture so that they could leverage the data in real time as new events occur. More than 80% of the Fortune 100 are building their businesses on this new platform. In this talk, I will first share the story behind Kafka: how it was invented, what problem it was trying to solve and how it has been evolving. Then, I will talk about how making Kafka Cloud native creates new opportunities for building a system of record and some real world use cases.
10:40am - 11:00am

Week 1 - June 8: Performance, Reliability, and Efficiency - Introducing Zelos - Zookeeper API leveraging Delos

In this presentation we will introduce Zelos. Zelos provides the exact same semantics as ZooKeeper but is built using Delos. ZooKeeper forms the foundation of Meta's infrastructure stack and we have been using it over a decade. Over the decade we have improved the performance of ZooKeeper to meet our scale requirement but every such improvement has been non trivial amount of work especially given the monolithic design of ZooKeeper. In this talk we will deep dive into Zelos's architecture and show how it enables to solve some of the scaling limits we were hitting with ZooKeeper. Migrating with 0 downtime has been an important goal and we will give an overview on how we went about doing it.
11:00am - 11:20am

Week 1 - June 8: Performance, Reliability, and Efficiency - Hosting Open Source Relational Databases at Scale on Microsoft Azure

Hosting managed relational database services in the cloud with the level of availability, reliability guarantees demanded by mission critical workloads and doing it at scale presents a set of interesting challenges. This talk will walk you through the evolution journey of Open Source relational database services hosting platform on Azure and how these challenges were addressed while meeting the agility, performance, stability, and cost goals from the business and customers
11:20am - 11:50am

Week 1 - June 8: Performance, Reliability, and Efficiency - Live Q&A Session

10:00am - 10:20am

Week 2 - June 15: Reliability - How Meta Keeps its Large-scale Infrastructure Hardware Up and Running

Internet services like Facebook, Instagram, and Whatsapp rely on large-scale infrastructure to support the various compute, storage, and AI workloads. With the support of data and ML techniques, we can scale our infrastructure successfully by improving the efficiency of our tooling and workflows. In this presentation we’ll share our recent work on hardware remediation, automated anomaly detection and root cause analysis, error reporting interrupt tuning for minimizing performance overhead, near-real time at-scale server reboot detection, and an ML framework for predicting repairs for hardware failures. The data and ML solutions help us engage people less but with more context, so we can focus people on the real challenging work while the repetitive tasks are automated.
10:20am - 10:40am

Week 2 - June 15: Reliability - DADI @ Scale: Deploying Containers at Scale in Alibaba

Alibaba Cloud offers a comprehensive suite of elastic computing services that are based on container technology. Alibaba Group is one of the key customers of Alibaba Cloud and all of the major applca- tions across its large and diverse set of businesses are run in containers. In this talk, we present DADI, the image system that underpins Al- ibaba’s containers, and share our experience with deploying it at scale worldwide to serve all of Alibaba Group and a large and rapidly grow- ing number of external customers on the Alibaba Cloud. DADI is a block-level image system that replaces the waterfall model of starting containers (downloading image, unpacking image, starting container) with fine-grained on-demand transfer of remote images, realizing in- stant start of containers. DADI relies on a peer-to-peer architecture in large clusters to balance network traffic among all the participat- ing hosts. One of the unique features of DADI is that it is based on the standard block device so that the image system is file system and platform agnostic, enabling one image system to handle the many ap- plication and container platforms that inevitably span very large orga- nizations including Alibaba. The system is high-extensible, allowing us to quickly add features including trace-based prefetching and custom acceleration of container provisioning for different computing services such as serverless computing or Function-as-a-Service (FaaS). As part of this talk, we highlight the ease with which DADI can support new container technologies including those based on Kata Containers, fire- cacker and gVisor. We conclude with a discussion of ongoing efforts towards more secure containers by leveraging the small attack surface of the DADI block device, and decrypting the container image only within the container.
10:40am - 11:00am

Week 2 - June 15: Reliability - Scaling End to End Reliability Tracking Across Large Scale, Multiplexed Products and Services

This talk introduces a new user experience-focused reliability measurement that exposes end-to-end reliability guarantees across the vertical service stack used by Meta’s family of Apps. The talk discusses the difference between the new reliability approach versus the industry standard and our successes and what comes next.
11:00am - 11:20am

Week 2 - June 15: Reliability - Don't Ship the Org Chart: Rebuilding Istio for User Maintainability

While the cry of "breaking apart the monolith" can be heard throughout the industry, the Istio service mesh took a different tack, and consolidated its control plane microservices into one binary. How did we get here? In this talk, Google DA and Istio Steering Committee member Craig Box will talk about how the team building the service mesh designed it based on the internal Google services it was emulating, why that turned out to be the wrong choice, and how the ship was righted. Google engineer and Istio Environments working group lead Sam Naser will talk about how the new model allows for safer upgrades of the mesh, letting users test new versions with a canary model before rolling out to the whole fleet.
11:20am - 11:50am

Week 2 - June 15: Reliability - Live Q&A Session

10:00am - 10:20am

Week 3 - June 22: Managing Services: Part 1 - Configuration Safety at Scale with Ads

The Configerator repository provides Meta developers with a way to make changes easily and quickly to production services. By default, it pushes changes to all services at Meta in a matter of seconds, and doesn’t have the traditional safeguards that most services have of being able to do extensive testing before releasing a new version of the service. It is one of the main things that enables Meta’s move-fast culture and lets developers iterate quickly and have fast development cycles. However, if not done properly, this can result in many reliability issues. This talk goes over the reliability issues that the Ads organization has had due to these config changes and how we have improved the reliability over time by investing into various safety measures like build time validations, diff-time testing, and helped develop and use a new service that can gradually push config changes.
10:20am - 10:40am

Week 3 - June 22: Managing Services: Part 1 - Getting from Schemaless Ingest to Fast SQL at Rockset

Rockset provides low-latency SQL access to schemaless data that is ingested in real-time. Immediate access to dynamically structured data is very powerful, enabling rapid development and iteration for products built on top, but it needs to be baked into the design and implementation from the start. Rockset’s architecture incorporates ideas from search engines, OLTP databases, and OLAP engines. In this talk we will present some of the database and low-level C++ programming techniques we use to get excellent performance and efficiency over dynamic data.
10:40am - 11:00am

Week 3 - June 22: Managing Services: Part 1 - The Ent Framework: Meta’s Object-Relational Mapping

When you think about Meta’s family of apps, what comes to mind? Maybe the over 6 thousand photos and videos created per second on Instagram, the 5 trillion photos on Facebook, or the 60 million group posts loaded each second. It’s challenging to manage all of the associated databases, schemas, queries and constraints at scale like ours. How do we keep this data consistent? How do we handle who can access or read this data in different contexts with different roles and permissions? How do we make it so engineers across teams can easily understand and onboard to another team’s database model? These challenges are why Meta created the Ent Framework, our Object-Relational Mapping layer. The Ent Framework simplifies development for tens of thousands of engineers by automating and simplifying how they integrate with multitudes of different storage systems. Learn about how Meta keeps databases more secure, code less repetitive and the family of apps more robust via the Ent Framework.
11:00am - 11:30am

Week 3 - June 22: Managing Services: Part 1 - Live Q&A Session

10:00am - 10:20am

Week 4 - June 29: Managing Services: Part 2 - Infra Cloud Service Platform (ICSP)

Building and operating a service is challenging and complex. At scale, service owners need to consider a number of responsibilities including how they develop, deploy, scale and monitor their service in production. Each of these concerns may require a service owner to understand, configure, and operate multiple underlying supporting systems to accomplish their goal. Service owners desire a solution that's simpler to develop and manage. Infra Cloud Service Platform (ICSP) is a holistic infrastructure product aimed at reducing the complexity of developing and operating services at Meta. As a platform, ICSP provides an integrated ecosystem that streamlines and orchestrates the pieces for the user. The platform is built on three important pillars: 1) a common configuration and control system (the Control Plane), 2) a logical model of operation (Service Data Model), and 3) code framework (Service Code Experience).
10:20am - 10:40am

Week 4 - June 29: Managing Services: Part 2 - Lessons Learned from Scaling Infrastructure as Code

You adopted an infrastructure as code tool like Terraform. What started as one person writing some configuration and deploying new infrastructure scales to everyone in the company writing their own infrastructure configuration and deploying their own systems. In this talk, we’ll share some of the lessons learned across the Terraform community when scaling infrastructure as code practices from one team to an entire company and its users. We’ll cover the patterns and practices that help address challenges of updating infrastructure, managing infrastructure modules, maintaining security, streamlining cost, and even upgrading and migrating tools.
10:40am - 11:00am

Week 4 - June 29: Managing Services: Part 2 - Global Capacity Management at Meta

Meta currently operates more than 15 data center regions around the world. This rapidly expanding global datacenter footprint poses new challenges for service owners and for our infrastructure management systems. In this talk, we will present the challenges with managing a global-scale infrastructure and our approach for global service and capacity management. In particular, we’ll focus on the abstractions and guarantees we present to service owners with global capacity, and we’ll walk through our current design and implementation for how we manage our workloads across 10s of regions. We’ll also present our future plans with Infra Cloud as we build towards our longer term vision of transparent automated global capacity management.
11:00am - 11:30am

Week 4 - June 29: Managing Services: Part 2 - Live Q&A Session

Live Q&A with All Speakers

Join the @Scale Mailing List and Get the Latest News & Event Info

Code of Conduct

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy