TOPIC: Data, Systems and Networking

Systems @Scale Winter 2023

DECEMBER 13, 2023

Designed for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges.

Register today and check back for upcoming speaker and agenda announcements!

RSVPS CLOSED
AGENDA SPEAKERS

ABOUT EVENT

Systems @Scale Winter 2023 will be hosted virtually. Joining us are speakers from Figma, Google, Meta, Microsoft and Uber. The event will showcase talks themed around three main topic areas: scaling AI platforms, performance and reliability at scale and building systems at planetary scale.

EVENT AGENDA

Event times below are displayed in PT.

December 13

09:00 AM - 09:05 AM
Opening Remarks
Host Biren Damani,Meta
Session 1: Scaling AI Platforms
09:05 AM - 09:25 AM
Multi-Tenancy for AI Inference @ Meta Scale

With the increasingly diverse landscape of AI workloads, it's challenging to build an efficient and reliable infrastructure especially with the emerging powerful and expensive AI accelerators. We identify multi-tenancy as a key strategy to this end. By understanding the characteristics of AI workloads and their supporting hardware, we have an opportunity to optimize workload colocation to achieve significant infra cost savings.

Speaker Bikash Sharma,Meta
Speaker Leon Yang,Meta
Speaker Ha Pham,Meta
Speaker Zhen Qin,Meta
Featured Blog
MULTI-TENANCY FOR AI INFERENCE AT META SCALE  read more
09:25 AM - 09:45 AM
Automated GPU Sharing at Scale

General-purpose GPUs, with their powerful numerical computing capacity, are popular platforms for accelerating machine-learning workloads. However, GPU workloads often fail to keep the GPU pipeline fully occupied, resulting in low overall resource utilization. To address this inefficiency, we have designed and implemented GPU sharing to improve overall throughput and utilization at cluster level.

Speaker Xiao Zhang,Google
09:45 AM - 10:05 AM
Beyond the Buzz: Evolution of AIOps to Improve Reliability at Scale

Operating globally distributed services at Meta scale is no easy feat. In a world with increasing complexity of systems & ever-growing telemetry data, engineers are left looking for a needle in a haystack during high-pressure critical incidents. How can we automate and assist engineers to accelerate root cause analysis and incident mitigation? In this talk we will demystify the industry buzz around AIOps. You will learn about our multi-year journey of embracing AIOps at Meta and leave with a blueprint for improving the reliability of your systems!

Speaker Nitin Gupta,Meta
Speaker Madhura Parikh,Meta
Featured Blog
THE EVOLUTION OF AIOPS AT META: BEYOND THE BUZZ  read more
10:05 AM - 10:20 AM
Q&A
Moderator Shilpa Lawande,Meta
Speaker Madhura Parikh,Meta
Speaker Bikash Sharma,Meta
Speaker Leon Yang,Meta
Speaker Ha Pham,Meta
Speaker Zhen Qin,Meta
Speaker Xiao Zhang,Google
Speaker Nitin Gupta,Meta
10:20 AM - 10:40 AM
Break
Session 2: Building Systems at Planetary Scale
10:40 AM - 11:00 AM
Conveyor: One-Tool-Fits-All Continuous Software Deployment at Meta

This presentation provides an overview of the design, implementation, usage scenarios and statistics of Conveyor, Meta's Continuous Deployment system. Conveyor has been in production since 2015 and performs over 100,000 deployments per week across more than 10,000 services and millions of machines. We provide an analysis of real-world use cases and highlight advanced features necessary for a deployment tool to effectively support all services in the fleet.

Speaker Eddy Li,Meta
Speaker Brian Fitzpatrick,Meta
Featured Blog
FORWARD AND BACK AND FORWARD AND BACK: BUILDING CONVEYOR  read more
11:00 AM - 11:20 AM
LiveGraph - Scaling Real-Time Data Access

LiveGraph is a GraphQL-like system that automatically keeps data up-to-date in the UI as changes happen to the underlying data. When it was first built, Figma had 500,000 weekly active users and only a single Postgres instance. Since then, we've scaled LiveGraph to keep up with the growth of Figma. Most recently we’ve built a subscribe-able distributed query cache, this is the story of our journey there.

Speaker Braden Walker,Figma
11:20 AM - 11:40 AM
Bringing Data to Asynchronous Computing at Scale

We delve into the details of bridging Asynchronous Computing with Meta’s data streams and the data warehouse, enabling a multitude of use cases at Meta. We share insights into technical challenges faced and the strategies employed to scale our data transport layer to handle trillions of daily jobs across thousands of heterogeneous tenants. The associated learnings hold relevance to high-throughput multi-tenant systems in general.

Speaker Sayak Kundu,Meta
Speaker Omar Abdou,Meta
Featured Blog
BRINGING DATA TO ASYNCHRONOUS COMPUTING AT SCALE  read more
11:40 AM - 12:00 PM
Data Lineage Unleashed: Nitro’s Approach to Automatic Restatements

In an era where data is the lifeblood of enterprise innovation and decision-making, understanding and managing complex data lineage has emerged as a critical challenge for data engineers. This talk delves into the intricacies of data flows within modern businesses, where each layer's changes ripple with profound implications for downstream processes. Addressing the pressing need for automatic restatements, we explore the sophisticated mechanisms behind Microsoft's Nitro — a system designed to support data pipeline operations, with robust automatic restatement capabilities. Join us to unravel the complexities of data lineage and discover how cutting-edge systems like Nitro are revolutionizing the way data changes are propagated, ensuring data integrity and agility in fast-paced business environments.

Speaker Jack Pullikottil,Microsoft
12:00 PM - 12:20 PM
Q&A
Moderator Shilpa Lawande,Meta
Speaker Eddy Li,Meta
Speaker Brian Fitzpatrick,Meta
Speaker Braden Walker,Figma
Speaker Sayak Kundu,Meta
Speaker Omar Abdou,Meta
Speaker Jack Pullikottil,Microsoft
12:20 PM - 12:40 PM
Break
Session 3: Performance and Reliability At Scale
12:40 PM - 01:00 PM
Crane: Uber’s Next-Gen Infrastructure Stack

Uber has been on a multi-year journey to reimagine our infrastructure stack for a hybrid, multi-cloud world. The internal code name for this project is Crane. In this talk we’ll examine the original motivation behind Crane, requirements we needed to satisfy, and some key features of our implementation. Finally, we’ll wrap up with some forward-looking views for Uber’s infrastructure.

Speaker Kurtis Nusbaum,Uber
01:00 PM - 01:20 PM
Warehouse Disaster Recovery

The presentation gives an overview of warehouse disaster recovery (DR). First, it provides context on company-wise DR efforts. Then it gives a quick overview of the data warehouse ecosystem and the importance of warehouse DR efforts. Later, it provides warehouse DR design and recent improvement for batch workload recovery.

Speaker Abhishek Marwah,Meta
Speaker Rong Shi,Meta
Featured Blog
WAREHOUSE DISASTER RECOVERY: BATCH-WORKLOAD RECOVERY AT SCALE  read more
01:20 PM - 01:40 PM
Mobile Configuration At Meta

MobileConfig is Meta's internal tool for the authoring and distributing configuration values across its many mobile apps and platforms, including Facebook, Instagram, and Oculus. Using MobileConfig, a developer can make a configuration change in minutes and have it distributed quickly and reliably to the billions of users who use our apps. Developers use the system to safely roll out mobile features, configure our many apps, and perform experimentation. We also provide a set of tools that give developers insights into the rollout of config values and allow rapid deployment of configuration values in emergencies. In this presentation, we will discuss some core features of MobileConfig, including our consistency model, performance optimizations, and how mobile configuration can push mobile agile software development to the extreme.

Speaker Michael Leighton,Meta
Featured Blog
MOBILE CONFIGURATION AT META: THE KEY TO MOBILE AGILE DEVELOPMENT AT SCALE  read more
01:40 PM - 02:00 PM
Q&A
Moderator Shilpa Lawande,Meta
Speaker Kurtis Nusbaum,Uber
Speaker Abhishek Marwah,Meta
Speaker Rong Shi,Meta
Speaker Michael Leighton,Meta
02:00 PM - 02:05 PM
Closing Remarks
Host Biren Damani,Meta

SPEAKERS AND MODERATORS

Biren Damani is an Engineering Manager within the Core Systems Organization at Meta. In... read more

Biren Damani

Meta

Bikash is a software performance engineer at Meta focusing on AI inference efficiency. He... read more

Bikash Sharma

Meta

Leon Yang is a software engineer on the Meta Containers team working on productionizing... read more

Leon Yang

Meta

Ha is a software engineer at Meta, working on GPU inference for Ads ML... read more

Ha Pham

Meta

Zhen Qin is a software engineer on the Meta Ads ML Model Serving team... read more

Zhen Qin

Meta

Xiao Zhang is a senior staff engineer at Google, working on cluster resource management.... read more

Xiao Zhang

Google

Nitin Gupta is an Engineering Manager at Meta. For the past 8 years, he... read more

Nitin Gupta

Meta

Madhura is a Software Engineer at Meta, where she has been working on Monitoring... read more

Madhura Parikh

Meta

Shilpa Lawande is an Engineering Director at Meta, based in Boston. Shilpa’s team builds... read more

Shilpa Lawande

Meta

Eddy is a Software Engineer at Meta, and has worked there for two and... read more

Eddy Li

Meta

Brian is a software engineer at Meta working on Continuous Deployment solutions with an... read more

Brian Fitzpatrick

Meta

Braden works at Figma on LiveGraph, a real-time GraphQL-like system. Before that, he worked... read more

Braden Walker

Figma

Sayak is a software engineer in the Serverless Computing infrastructure team at Meta. Previously,... read more

Sayak Kundu

Meta

Omar Abdou is a Software Engineer on the Core Systems Team at Meta working... read more

Omar Abdou

Meta

Jack is a Partner Architect at Microsoft where he's spent over 16 years working... read more

Jack Pullikottil

Microsoft

Kurtis Nusbaum is a Senior Staff Software Engineer on the Uber Infrastructure team in... read more

Kurtis Nusbaum

Uber

Abhishek Marwah

Meta

Rong Shi has been a research scientist at Meta since 2018. He is in... read more

Rong Shi

Meta

Michael Leighton is a Software engineer in the configuration products platform team at Meta.... read more

Michael Leighton

Meta

LATEST NOTES

Systems @Scale
12/13/2023
Multi-Tenancy for AI Inference at Meta Scale 
Introduction: AI inference at Meta At Meta, AI workloads are pervasive across a wide range of products. Some of those...
Systems @Scale
12/13/2023
The Evolution of AIOps at Meta: Beyond the Buzz
Meta runs thousands of services across millions of servers and multiple data centers throughout the world. Operating such distributed systems...
Systems @Scale
12/13/2023
Forward and Back and Forward and Back: Building Conveyor
Additional author: Boris Grubic Think of how you push new releases of anything, whether at work or on some other...
Systems @Scale
12/13/2023
Bringing Data to Asynchronous Computing at Scale
Asynchronous computing and data processing are building blocks in the modern cloud. The Async Tier is Meta’s platform for serverless...
Systems @Scale
12/13/2023
Warehouse Disaster Recovery: Batch-Workload Recovery at Scale
Site-wide outages (e.g., network or power outages) would cause huge costs in terms of lost revenue and damage to Meta’s...
Systems @Scale
12/13/2023
Mobile Configuration at Meta: The Key to Mobile Agile Development at Scale 
Additional authors: Amit Adhikari, Tong Bao, Diedi Hu, Matt Guo, Zhao Wang, and Arjun Bhasin Intro to MobileConfig MobileConfig is...
UPCOMING EVENT   March 20, 2024 @ 9am PT - 3pm PT RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...
PAST EVENT   07/19/2023 Systems @Scale

Systems @Scale Summer 2023

Systems @Scale Summer 2023 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented...
PAST EVENT   12/14/2022 Systems @Scale

Systems @Scale Winter 2022

Systems @Scale Winter 2022 is a technical conference designed for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges. The @Scale community focuses...

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy