EVENT AGENDA
Event times below are displayed in PT.
Designed for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges.
Register today and check back for upcoming speaker and agenda announcements!
Systems @Scale Winter 2023 will be hosted virtually. Joining us are speakers from Figma, Google, Meta, Microsoft and Uber. The event will showcase talks themed around three main topic areas: scaling AI platforms, performance and reliability at scale and building systems at planetary scale.
Event times below are displayed in PT.
With the increasingly diverse landscape of AI workloads, it's challenging to build an efficient and reliable infrastructure especially with the emerging powerful and expensive AI accelerators. We identify multi-tenancy as a key strategy to this end. By understanding the characteristics of AI workloads and their supporting hardware, we have an opportunity to optimize workload colocation to achieve significant infra cost savings.
General-purpose GPUs, with their powerful numerical computing capacity, are popular platforms for accelerating machine-learning workloads. However, GPU workloads often fail to keep the GPU pipeline fully occupied, resulting in low overall resource utilization. To address this inefficiency, we have designed and implemented GPU sharing to improve overall throughput and utilization at cluster level.
Operating globally distributed services at Meta scale is no easy feat. In a world with increasing complexity of systems & ever-growing telemetry data, engineers are left looking for a needle in a haystack during high-pressure critical incidents. How can we automate and assist engineers to accelerate root cause analysis and incident mitigation? In this talk we will demystify the industry buzz around AIOps. You will learn about our multi-year journey of embracing AIOps at Meta and leave with a blueprint for improving the reliability of your systems!
This presentation provides an overview of the design, implementation, usage scenarios and statistics of Conveyor, Meta's Continuous Deployment system. Conveyor has been in production since 2015 and performs over 100,000 deployments per week across more than 10,000 services and millions of machines. We provide an analysis of real-world use cases and highlight advanced features necessary for a deployment tool to effectively support all services in the fleet.
LiveGraph is a GraphQL-like system that automatically keeps data up-to-date in the UI as changes happen to the underlying data. When it was first built, Figma had 500,000 weekly active users and only a single Postgres instance. Since then, we've scaled LiveGraph to keep up with the growth of Figma. Most recently we’ve built a subscribe-able distributed query cache, this is the story of our journey there.
We delve into the details of bridging Asynchronous Computing with Meta’s data streams and the data warehouse, enabling a multitude of use cases at Meta. We share insights into technical challenges faced and the strategies employed to scale our data transport layer to handle trillions of daily jobs across thousands of heterogeneous tenants. The associated learnings hold relevance to high-throughput multi-tenant systems in general.
In an era where data is the lifeblood of enterprise innovation and decision-making, understanding and managing complex data lineage has emerged as a critical challenge for data engineers. This talk delves into the intricacies of data flows within modern businesses, where each layer's changes ripple with profound implications for downstream processes. Addressing the pressing need for automatic restatements, we explore the sophisticated mechanisms behind Microsoft's Nitro — a system designed to support data pipeline operations, with robust automatic restatement capabilities. Join us to unravel the complexities of data lineage and discover how cutting-edge systems like Nitro are revolutionizing the way data changes are propagated, ensuring data integrity and agility in fast-paced business environments.
Uber has been on a multi-year journey to reimagine our infrastructure stack for a hybrid, multi-cloud world. The internal code name for this project is Crane. In this talk we’ll examine the original motivation behind Crane, requirements we needed to satisfy, and some key features of our implementation. Finally, we’ll wrap up with some forward-looking views for Uber’s infrastructure.
The presentation gives an overview of warehouse disaster recovery (DR). First, it provides context on company-wise DR efforts. Then it gives a quick overview of the data warehouse ecosystem and the importance of warehouse DR efforts. Later, it provides warehouse DR design and recent improvement for batch workload recovery.
MobileConfig is Meta's internal tool for the authoring and distributing configuration values across its many mobile apps and platforms, including Facebook, Instagram, and Oculus. Using MobileConfig, a developer can make a configuration change in minutes and have it distributed quickly and reliably to the billions of users who use our apps. Developers use the system to safely roll out mobile features, configure our many apps, and perform experimentation. We also provide a set of tools that give developers insights into the rollout of config values and allow rapid deployment of configuration values in emergencies. In this presentation, we will discuss some core features of MobileConfig, including our consistency model, performance optimizations, and how mobile configuration can push mobile agile software development to the extreme.
Biren Damani is an Engineering Manager within the Core Systems Organization at Meta. In... read more
Bikash is a software performance engineer at Meta focusing on AI inference efficiency. He... read more
Leon Yang is a software engineer on the Meta Containers team working on productionizing... read more
Ha is a software engineer at Meta, working on GPU inference for Ads ML... read more
Zhen Qin is a software engineer on the Meta Ads ML Model Serving team... read more
Xiao Zhang is a senior staff engineer at Google, working on cluster resource management.... read more
Nitin Gupta is an Engineering Manager at Meta. For the past 8 years, he... read more
Madhura is a Software Engineer at Meta, where she has been working on Monitoring... read more
Shilpa Lawande is an Engineering Director at Meta, based in Boston. Shilpa’s team builds... read more
Eddy is a Software Engineer at Meta, and has worked there for two and... read more
Brian is a software engineer at Meta working on Continuous Deployment solutions with an... read more
Braden works at Figma on LiveGraph, a real-time GraphQL-like system. Before that, he worked... read more
Sayak is a software engineer in the Serverless Computing infrastructure team at Meta. Previously,... read more
Omar Abdou is a Software Engineer on the Core Systems Team at Meta working... read more
Jack is a Partner Architect at Microsoft where he's spent over 16 years working... read more
Kurtis Nusbaum is a Senior Staff Software Engineer on the Uber Infrastructure team in... read more
Rong Shi has been a research scientist at Meta since 2018. He is in... read more
Michael Leighton is a Software engineer in the configuration products platform team at Meta.... read more