Systems @Scale Winter 2023

DECEMBER 13, 2023

Designed for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges.

RSVPS CLOSED

AGENDA AT A GLANCE FULL AGENDA SPEAKERS

ABOUT EVENT

Systems @Scale Winter 2023 will be hosted virtually. Joining us are speakers from Figma, Google, Meta, Microsoft and Uber. The event will showcase talks themed around three main topic areas: scaling AI platforms, performance and reliability at scale and building systems at planetary scale.

Agenda at a glance

View full agenda

Event times below are displayed in PT.

View full agenda

December 13

09:00 AM - 09:05 AM

Opening Remarks

Session 1: Scaling AI Platforms

09:05 AM - 09:25 AM

Multi-Tenancy for AI Inference @ Meta Scale

09:25 AM - 09:45 AM

Automated GPU Sharing at Scale

09:45 AM - 10:05 AM

Beyond the Buzz: Evolution of AIOps to Improve Reliability at Scale

10:05 AM - 10:20 AM

Q&A

10:20 AM - 10:40 AM

Break

Session 2: Building Systems at Planetary Scale

10:40 AM - 11:00 AM

Conveyor: One-Tool-Fits-All Continuous Software Deployment at Meta

11:00 AM - 11:20 AM

LiveGraph - Scaling Real-Time Data Access

11:20 AM - 11:40 AM

Bringing Data to Asynchronous Computing at Scale

11:40 AM - 12:00 PM

Data Lineage Unleashed: Nitro’s Approach to Automatic Restatements

12:00 PM - 12:20 PM

Q&A

12:20 PM - 12:40 PM

Break

Session 3: Performance and Reliability At Scale

12:40 PM - 01:00 PM

Crane: Uber’s Next-Gen Infrastructure Stack

01:00 PM - 01:20 PM

Warehouse Disaster Recovery

01:20 PM - 01:40 PM

Mobile Configuration At Meta

01:40 PM - 02:00 PM

Q&A

02:00 PM - 02:05 PM

Closing Remarks

SPEAKERS AND MODERATORS

Biren Damani is an Engineering Manager within the Core Systems Organization at Meta. In... read more

Biren Damani

Meta

Bikash is a software performance engineer at Meta focusing on AI inference efficiency. He... read more

Bikash Sharma

Meta

Leon Yang is a software engineer on the Meta Containers team working on productionizing... read more

Leon Yang

Meta

Ha is a software engineer at Meta, working on GPU inference for Ads ML... read more

Ha Pham

Meta

Zhen Qin is a software engineer on the Meta Ads ML Model Serving team... read more

Zhen Qin

Meta

Xiao Zhang is a senior staff engineer at Google, working on cluster resource management.... read more

Xiao Zhang

Google

Nitin Gupta is an Engineering Manager at Meta. For the past 8 years, he... read more

Nitin Gupta

Meta

Madhura is a Software Engineer at Meta, where she has been working on Monitoring... read more

Madhura Parikh

Meta

Shilpa Lawande is an Engineering Director at Meta, based in Boston. Shilpa’s team builds... read more

Shilpa Lawande

Meta

Eddy is a Software Engineer at Meta, and has worked there for two and... read more

Eddy Li

Meta

Brian is a software engineer at Meta working on Continuous Deployment solutions with an... read more

Brian Fitzpatrick

Meta

Braden works at Figma on LiveGraph, a real-time GraphQL-like system. Before that, he worked... read more

Braden Walker

Figma

Sayak is a software engineer in the Serverless Computing infrastructure team at Meta. Previously,... read more

Sayak Kundu

Meta

Omar Abdou is a Software Engineer on the Core Systems Team at Meta working... read more

Omar Abdou

Meta

Jack is a Partner Architect at Microsoft where he's spent over 16 years working... read more

Jack Pullikottil

Microsoft

Kurtis Nusbaum is a Senior Staff Software Engineer on the Uber Infrastructure team in... read more

Kurtis Nusbaum

Uber

Abhishek Marwah

Meta

Rong Shi has been a research scientist at Meta since 2018. He is in... read more

Rong Shi

Meta

Michael Leighton is a Software engineer in the configuration products platform team at Meta.... read more

Michael Leighton

Meta

EVENT AGENDA

Event times below are displayed in PT.

December 13

09:00 AM - 09:05 AM

Opening Remarks

Host Biren Damani,Meta

Session 1: Scaling AI Platforms

09:05 AM - 09:25 AM

Multi-Tenancy for AI Inference @ Meta Scale

WATCH NOW

With the increasingly diverse landscape of AI workloads, it's challenging to build an efficient and reliable infrastructure especially with the emerging powerful and expensive AI accelerators. We identify multi-tenancy as a key strategy to this end. By understanding the characteristics of AI workloads and their supporting hardware, we have an opportunity to optimize workload colocation to achieve significant infra cost savings.

Speaker Bikash Sharma,Meta

Speaker Leon Yang,Meta

Speaker Ha Pham,Meta

Speaker Zhen Qin,Meta

Featured Blog

MULTI-TENANCY FOR AI INFERENCE AT META SCALE read more

09:25 AM - 09:45 AM

Automated GPU Sharing at Scale

WATCH NOW

General-purpose GPUs, with their powerful numerical computing capacity, are popular platforms for accelerating machine-learning workloads. However, GPU workloads often fail to keep the GPU pipeline fully occupied, resulting in low overall resource utilization. To address this inefficiency, we have designed and implemented GPU sharing to improve overall throughput and utilization at cluster level.

Speaker Xiao Zhang,Google

09:45 AM - 10:05 AM

Beyond the Buzz: Evolution of AIOps to Improve Reliability at Scale

WATCH NOW

Operating globally distributed services at Meta scale is no easy feat. In a world with increasing complexity of systems & ever-growing telemetry data, engineers are left looking for a needle in a haystack during high-pressure critical incidents. How can we automate and assist engineers to accelerate root cause analysis and incident mitigation? In this talk we will demystify the industry buzz around AIOps. You will learn about our multi-year journey of embracing AIOps at Meta and leave with a blueprint for improving the reliability of your systems!

Speaker Nitin Gupta,Meta

Speaker Madhura Parikh,Meta

Featured Blog

THE EVOLUTION OF AIOPS AT META: BEYOND THE BUZZ read more

10:05 AM - 10:20 AM

Q&A

WATCH NOW

Moderator Shilpa Lawande,Meta

Speaker Madhura Parikh,Meta

Speaker Bikash Sharma,Meta

Speaker Leon Yang,Meta

Speaker Ha Pham,Meta

Speaker Zhen Qin,Meta

Speaker Xiao Zhang,Google

Speaker Nitin Gupta,Meta

10:20 AM - 10:40 AM

Break

Session 2: Building Systems at Planetary Scale

10:40 AM - 11:00 AM

Conveyor: One-Tool-Fits-All Continuous Software Deployment at Meta

WATCH NOW

This presentation provides an overview of the design, implementation, usage scenarios and statistics of Conveyor, Meta's Continuous Deployment system. Conveyor has been in production since 2015 and performs over 100,000 deployments per week across more than 10,000 services and millions of machines. We provide an analysis of real-world use cases and highlight advanced features necessary for a deployment tool to effectively support all services in the fleet.

Speaker Eddy Li,Meta

Speaker Brian Fitzpatrick,Meta

Featured Blog

FORWARD AND BACK AND FORWARD AND BACK: BUILDING CONVEYOR read more

11:00 AM - 11:20 AM

LiveGraph - Scaling Real-Time Data Access

WATCH NOW

LiveGraph is a GraphQL-like system that automatically keeps data up-to-date in the UI as changes happen to the underlying data. When it was first built, Figma had 500,000 weekly active users and only a single Postgres instance. Since then, we've scaled LiveGraph to keep up with the growth of Figma. Most recently we’ve built a subscribe-able distributed query cache, this is the story of our journey there.

Speaker Braden Walker,Figma

11:20 AM - 11:40 AM

Bringing Data to Asynchronous Computing at Scale

WATCH NOW

We delve into the details of bridging Asynchronous Computing with Meta’s data streams and the data warehouse, enabling a multitude of use cases at Meta. We share insights into technical challenges faced and the strategies employed to scale our data transport layer to handle trillions of daily jobs across thousands of heterogeneous tenants. The associated learnings hold relevance to high-throughput multi-tenant systems in general.

Speaker Sayak Kundu,Meta

Speaker Omar Abdou,Meta

Featured Blog

BRINGING DATA TO ASYNCHRONOUS COMPUTING AT SCALE read more

11:40 AM - 12:00 PM

Data Lineage Unleashed: Nitro’s Approach to Automatic Restatements

WATCH NOW

In an era where data is the lifeblood of enterprise innovation and decision-making, understanding and managing complex data lineage has emerged as a critical challenge for data engineers. This talk delves into the intricacies of data flows within modern businesses, where each layer's changes ripple with profound implications for downstream processes. Addressing the pressing need for automatic restatements, we explore the sophisticated mechanisms behind Microsoft's Nitro — a system designed to support data pipeline operations, with robust automatic restatement capabilities. Join us to unravel the complexities of data lineage and discover how cutting-edge systems like Nitro are revolutionizing the way data changes are propagated, ensuring data integrity and agility in fast-paced business environments.

Speaker Jack Pullikottil,Microsoft

12:00 PM - 12:20 PM

Q&A

WATCH NOW

Moderator Shilpa Lawande,Meta

Speaker Eddy Li,Meta

Speaker Brian Fitzpatrick,Meta

Speaker Braden Walker,Figma

Speaker Sayak Kundu,Meta

Speaker Omar Abdou,Meta

Speaker Jack Pullikottil,Microsoft

12:20 PM - 12:40 PM

Break

Session 3: Performance and Reliability At Scale

12:40 PM - 01:00 PM

Crane: Uber’s Next-Gen Infrastructure Stack

WATCH NOW

Uber has been on a multi-year journey to reimagine our infrastructure stack for a hybrid, multi-cloud world. The internal code name for this project is Crane. In this talk we’ll examine the original motivation behind Crane, requirements we needed to satisfy, and some key features of our implementation. Finally, we’ll wrap up with some forward-looking views for Uber’s infrastructure.

Speaker Kurtis Nusbaum,Uber

01:00 PM - 01:20 PM

Warehouse Disaster Recovery

WATCH NOW

The presentation gives an overview of warehouse disaster recovery (DR). First, it provides context on company-wise DR efforts. Then it gives a quick overview of the data warehouse ecosystem and the importance of warehouse DR efforts. Later, it provides warehouse DR design and recent improvement for batch workload recovery.

Speaker Abhishek Marwah,Meta

Speaker Rong Shi,Meta

Featured Blog

WAREHOUSE DISASTER RECOVERY: BATCH-WORKLOAD RECOVERY AT SCALE read more

01:20 PM - 01:40 PM

Mobile Configuration At Meta

WATCH NOW

MobileConfig is Meta's internal tool for the authoring and distributing configuration values across its many mobile apps and platforms, including Facebook, Instagram, and Oculus. Using MobileConfig, a developer can make a configuration change in minutes and have it distributed quickly and reliably to the billions of users who use our apps. Developers use the system to safely roll out mobile features, configure our many apps, and perform experimentation. We also provide a set of tools that give developers insights into the rollout of config values and allow rapid deployment of configuration values in emergencies. In this presentation, we will discuss some core features of MobileConfig, including our consistency model, performance optimizations, and how mobile configuration can push mobile agile software development to the extreme.

Speaker Michael Leighton,Meta

Featured Blog

MOBILE CONFIGURATION AT META: THE KEY TO MOBILE AGILE DEVELOPMENT AT SCALE read more

01:40 PM - 02:00 PM

Q&A

WATCH NOW

Moderator Shilpa Lawande,Meta

Speaker Kurtis Nusbaum,Uber

Speaker Abhishek Marwah,Meta

Speaker Rong Shi,Meta

Speaker Michael Leighton,Meta

02:00 PM - 02:05 PM

Closing Remarks

Host Biren Damani,Meta

LATEST NOTES

Systems & Reliability @Scale

12/13/2023

Multi-Tenancy for AI Inference at Meta Scale

Introduction: AI inference at Meta At Meta, AI workloads are pervasive across a wide range of products. Some of those...

Systems & Reliability @Scale

12/13/2023

The Evolution of AIOps at Meta: Beyond the Buzz

Meta runs thousands of services across millions of servers and multiple data centers throughout the world. Operating such distributed systems...

Systems & Reliability @Scale

12/13/2023

Forward and Back and Forward and Back: Building Conveyor

Additional author: Boris Grubic Think of how you push new releases of anything, whether at work or on some other...

Systems & Reliability @Scale

12/13/2023

Bringing Data to Asynchronous Computing at Scale

Asynchronous computing and data processing are building blocks in the modern cloud. The Async Tier is Meta’s platform for serverless...

Systems & Reliability @Scale

12/13/2023

Warehouse Disaster Recovery: Batch-Workload Recovery at Scale

Site-wide outages (e.g., network or power outages) would cause huge costs in terms of lost revenue and damage to Meta’s...

Systems & Reliability @Scale

12/13/2023

Mobile Configuration at Meta: The Key to Mobile Agile Development at Scale

Additional authors: Amit Adhikari, Tong Bao, Diedi Hu, Matt Guo, Zhao Wang, and Arjun Bhasin Intro to MobileConfig MobileConfig is...

UPCOMING EVENT March 20, 2024 @ 9am PT - 3pm PT | RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

PAST EVENT 07/19/2023 | Systems & Reliability @Scale

Systems @Scale Summer 2023

Systems @Scale Summer 2023 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented...

PAST EVENT 12/14/2022 | Systems & Reliability @Scale

Systems @Scale Winter 2022

Systems @Scale Winter 2022 is a technical conference designed for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges. The @Scale community focuses...