Systems @Scale Summer 2023

JULY 18, 2023

JULY 19, 2023

Systems @Scale Summer 2023 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering challenges. The @Scale community focuses on bringing people's experiences in the creation of innovative solutions.

RSVPS CLOSED

AGENDA SPEAKERS

ABOUT EVENT

Systems @Scale Summer 2023 will be hosted virtually. Joining us are speakers from Alibaba, Anyscale, Google, Meta, Microsoft and Uber. The event will showcase a keynote address by Ion Stoica, Professor at UC Berkeley and Executive Chairman of DataBricks and AnyScale. It will also feature talks themed around four main topic areas: AI Platform, Hyperscale Infrastructure, Performance Management and Developer Experience.

Systems @Scale Summer 2023 is organized by: Hui Lei, Marius Eriksen, Jason Flinn, Vipul Patel, Wendy Tam and Hong Tang.

EVENT AGENDA

Event times below are displayed in PT.

Day 1

Tuesday, July 18th

Day 2

Wednesday, July 19th

09:00 AM - 09:05 AM

Host Welcome

WATCH NOW

Speaker Hui Lei,Meta

Session 1: Keynote

09:05 AM - 09:25 AM

Ray, a unified distributed framework for the modern AI stack

WATCH NOW

The recent revolution of LLMs and Generative AI is triggering a sea change in virtually every industry. Building new AI applications or incorporating AI in existing applications require developers to stitch together and scale a plethora of workloads from data ingestions, pre-processing, training, tuning/finetuning and serving. This is a very challenging task as different workloads require different systems, each of these systems coming with its own APIs, semantics, and constraints. Ray can dramatically simplify building these applications by providing a unified framework that can support and scale all these workloads. As a result, Ray has been increasingly being used by companies across industries to build scalable ML infrastructures, platforms, and applications. Examples include Uber, Spotify, Instacart, Netflix, Cruise, Ant Group, ByteDance, and OpenAI (to train ChatGPT and other large models). In this talk, I will present the design considerations behind Ray, our experience with using Ray, and the lessons we learned in the process

Speaker Ion Stoica,AnyScale

09:25 AM - 09:35 AM

Q&A

WATCH NOW

Speaker Ion Stoica,AnyScale

Moderator Hui Lei,Meta

Session 2: AI Platform

09:35 AM - 09:55 AM

System for Scaling Large Language Models

WATCH NOW

Serving multi-terabyte models at planet scale is not an easy challenge. It takes everything engineers have in their arsenal to achieve the scale. The problem is further complicated by scientists' ambitions for complex and larger models. Adding in security requirements only adds more dimension to already challenging space. This talk is a collection of some of the lessons we have learned during the OpenAI model launch journey.

Speaker Manisha Jain,Microsoft

09:55 AM - 10:15 AM

Michelangelo ML Platform at Uber: Past, Present and Future

WATCH NOW

Today, machine learning plays a key role in Uber’s business, being used to make business critical decisions across the board from marketplace pricing, Eats search and discovery, maps ETA, fraud detection etc. Michelangelo is an end-to-end ML platform that democratizes machine learning and enables ML practitioners to seamlessly build, deploy, and operate machine learning solutions at Uber’s scale.

In this talk, we will discuss how Michelangelo addresses the challenges of streamlining ML developer experience with seamless UI and code driven model iteration, supporting large-scale deep learning with a declarative ML application framework, and consolidating fragmented ecosystems with a unified API framework inspired by Kubernetes CRD design pattern.

While Michelangelo had some industry leading features like Horovod, Palette feature store etc, the ML industry is rapidly evolving. We believe that the future of Michelangelo is an open ML platform that leverages the best-of-class 3rd party or in-house ML components. We will share some early experience on the plug-and-play of ML components in Michelangelo by using our unified API framework.

Speaker Min Cai,Uber

10:15 AM - 10:30 AM

IPnext: Next Generation Inference Platform

WATCH NOW

We built IPnext as a modular control plane that comprises multiple loosely coupled model management domains like deployment, update, and synchronization. We believe that our design promotes simplicity, flexibility, and reliability -- all necessary components to meet the demanding and dynamic requirements of ML inference systems.

Speaker Hitesh Khandelwal,Meta

Speaker Chao Xie,Meta

Featured Blog

IPNEXT: META’S NEXT GENERATION INFERENCE PLATFORM read more

10:30 AM - 10:45 AM

Q&A

WATCH NOW

Speaker Manisha Jain,Microsoft

Speaker Min Cai,Uber

Speaker Hitesh Khandelwal,Meta

Speaker Chao Xie,Meta

Moderator Ayichew Hailu,Meta

10:45 AM - 10:55 AM

Break

Session 3: Hyperscale Infrastructure

10:55 AM - 11:15 AM

Global Capacity Management with Flux

WATCH NOW

Flux is a system that uses global routing patterns to accurately size backend services based on frontend traffic demand shifts

Speaker Hayley Russell,Meta

Speaker Richard Cornew,Meta

11:15 AM - 11:35 AM

The Evolution of High Performance Networks: A Storage Perspective and Practices at Alibaba Cloud

WATCH NOW

This report presents the practice and evolution of high-performance networking in Alibaba Cloud Storage services. High-performance networking plays a crucial role in storage systems providing not only communication channels for users and storage services but also achieving high-speed interconnect between storage components. It is the cornerstone of highly available, performant cloud storage services.

Speaker Shu Ma,Alibaba

11:35 AM - 11:55 AM

ServiceRouter: Hyperscale Service Mesh at Meta

WATCH NOW

Datacenter applications are often structured as many interconnected microservices, and service mesh has become a popular approach to route RPC traffic among services. We present, ServiceRouter (SR), perhaps one of the world’s largest service meshes, which has been in production since 2012. SR differs from publicly known service meshes in several important ways. First, SR is designed for hyperscale and currently uses O(10^6 ) of L7 routers to route O(10^10) of requests per second across O(10^4 ) services. Second, in contrast to the common approach of using sidecar or remote proxies, SR employs an embedded routing library, which reduces the hardware cost of our hyperscale service mesh by O(10^5 ) machines. Third, SR provides built-in support for sharded services, which account for 92% of the total RPC requests in our fleet, whereas existing general-purpose service meshes do not support sharded services. Finally, SR introduces the concept of locality rings to simultaneously minimize RPC latency and balance load across geo-distributed regions, which to our knowledge has not been attempted before.

Speaker Harshit Saokar,Meta

Speaker Margot Leibold,Meta

Featured Blog

SERVICEROUTER: HYPERSCALE SERVICE MESH AT META read more

11:55 PM - 12:10 PM

Q&A

WATCH NOW

Speaker Hayley Russell,Meta

Speaker Richard Cornew,Meta

Speaker Shu Ma,Alibaba

Speaker Harshit Saokar,Meta

Speaker Margot Leibold,Meta

Moderator Ayichew Hailu,Meta

Session 4: Performance Management

09:00 AM - 09:20 AM

PolarDB: A Cloud Native Database of Disaggregated Architecture

WATCH NOW

PolarDB is Alibaba Cloud’s flagship database product. Since its birth 5 years ago, it explores new hardware and new architecture extensively, such as Intel Optane Memory, SmartSSD and RDMA high speed network etc. Recently, on top of compute-storage disaggregation, we proposed an additional disaggregated memory layer. It acts as an efficient communication hub for our multi-master solution, and also provides additional memory for complex queries. This talk will introduce PolarDB’s overall architecture and how it benefits from its multi-level disaggregated architecture.

Speaker David Zhang,Alibaba

09:20 AM - 09:45 AM

AI Observability At Meta Scale

WATCH NOW

AI training and inference constitute a large section of Meta’s infrastructure. Executing AI workload requires fast and expensive compute hardware along with powerful networking systems. This poses new challenges to our observability system and also lies opportunities with great potential. In this talk, we present scalable observability infrastructure and tools that enable building faster and more efficient AI software, and how we leverage this data for predictive analysis of efficiency of jobs.

Speaker Riham Selim,Meta

Speaker Valentin Andrei,Meta

Speaker Hao Wang,Meta

Speaker Lei Tian,Meta

FEATURED BLOG

System@Scale: AI Observability read more

09:45 AM - 10:05 AM

Stateful Web Service: A Distributed Lambda Function for Client-Server Interactions

WATCH NOW

Stateless PHP web services are widely used at Meta and they provide great developer experience. However, the stateless nature avoids the services to serve any real-time product experience since they cannot keep a persistent client-server connection or subscribe to server-side events and the constantly-changing data in the social graph. At Meta, we have built a stateful engine with a rich capability set, called BladeRunner, that can be leveraged by PHP developers to build interactive products between client and server where the server-side business logic resides in PHP. In practice, BladeRunner emulates a stateful PHP service such that product developers do not deal with complexities of maintaining long-lived connections and stateful services.

Speaker Pouya Zadkhast,Meta

Speaker Vahid Jazayeri,Meta

Featured Blog

STATEFUL PHP WEB SERVICE: A DISTRIBUTED LAMBDA FUNCTION FOR CLIENT-SERVER INTERACTIONS read more

10:05 AM - 10:25 AM

Q&A

WATCH NOW

Speaker Pouya Zadkhast,Meta

Speaker Riham Selim,Meta

Speaker Vahid Jazayeri,Meta

Speaker Valentin Andrei,Meta

Speaker Hao Wang,Meta

Speaker Lei Tian,Meta

Speaker David Zhang,Alibaba

Moderator Ariane Jansen,Meta

10:25 AM - 10:35 AM

Break

Session 5: Developer Experience

10:35 AM - 10:55 AM

Service Weaver: A Framework for Writing Distributed Systems

WATCH NOW

Service Weaver (serviceweaver.dev) is a programming framework that makes it easy to write, deploy, and manage cloud applications in Go. It allows you to postpone some of the hard decisions about how to split your application into microservices until later, while enabling you to write fewer and better microservices. Service Weaver improves application latency by up to 15x and reduces infrastructure costs by up to 9x compared to state-of-the-art approaches. Finally, it allows you to deploy the same application binary locally and across multiple cloud environments.

Speaker Robert Grandl,Google

10:55 AM - 11:15 AM

Prodspec & Annealing: Intent Based Deployment at Google

WATCH NOW

Google deployments used to be managed by a large number of heterogeneous & complex workflows. Currently, most of production is maintained through continuous intent-based deployment, as the result of the work since 2015 on Prodspec & Annealing - Google's continuous deployment infrastructure. In this talk, we will briefly describe what we call "continuous intent-based deployment" and then go over the main principles that made Prodspec & Annealing work at scale: why intent generation should be a first class citizen? How to approach intent actuation with Select-Update-Validate? And how those principles, along with a couple others, made it possible to encode best practices across many services.

Speaker Pierre Palatin,Google

11:15 AM - 11:35 AM

The Meta Thrift Journey: O (100 Billion QPS) Moved

WATCH NOW

How we keep the development costs low amid Meta's Growth. To do this we: (1) unified multiple different frameworks into a single one and (2) expanded the thrift schema from passive data structure to encapsulation.

Speaker Do Hyung (Dave) Kwon,Meta

Speaker TJ Yin,Meta

Speaker Nandhini Santhanam,Meta

ADDITIONAL RESOURCES

THE META THRIFT JOURNEY read more

11:35 AM - 11:50 AM

Q&A

WATCH NOW

Speaker Robert Grandl,Google

Speaker Pierre Palatin,Google

Speaker Do Hyung (Dave) Kwon,Meta

Speaker TJ Yin,Meta

Speaker Nandhini Santhanam,Meta

Moderator Ariane Jansen,Meta

11:50 AM - 11:55 AM

Closing Remarks

WATCH NOW

Speaker Hui Lei,Meta

SPEAKERS AND MODERATORS

Hui Lei is a Director of Engineering at Meta and a Fellow of the... read more

Hui Lei

Meta

Ion Stoica is a Professor in the EECS Department at the University of California... read more

Ion Stoica

AnyScale

Scale has been the central thread running through Manisha's career at Google (2005-2020), Meta... read more

Manisha Jain

Microsoft

Min Cai is a Distinguished Engineer at Uber working on the AI/ML platform (Michelangelo).... read more

Min Cai

Uber

Hitesh is a Software Engineer in the AI Infra team at Meta. Hitesh works... read more

Hitesh Khandelwal

Meta

Meta Engineer leading IPnext read more

Chao Xie

Meta

Ayichew has been an Engineering Manager at Meta for over four years around broad... read more

Ayichew Hailu

Meta

Hayley has been a software engineer at Meta since 2019. She has helped develop... read more

Hayley Russell

Meta

Richard is an engineer in the Capacity Infrastructure org with a focus on resource... read more

Richard Cornew

Meta

Shu Ma is a senior staff engineer at Alibaba Cloud. He has been working... read more

Shu Ma

Alibaba

Harshit joined Meta in 2017. Over the last six years, he contributed to various... read more

Harshit Saokar

Meta

Margot is a tenured software engineer specialized in developing large-scale distributed systems at Meta.... read more

Margot Leibold

Meta

David Zhang is a staff engineer at Alibaba working on databases and storage systems.... read more

David Zhang

Alibaba

Riham is a Performance Engineer at Meta where she works on building tools to... read more

Riham Selim

Meta

Valentin is a software performance engineer who works across teams to optimize AI software... read more

Valentin Andrei

Meta

I am a Performance and Capacity Engineer at Meta Resource Foundation team, working on... read more

Hao Wang

Meta

Lei Tian is a software engineer at Meta. He has been working in the... read more

Lei Tian

Meta

Pouya Zadkhast is a Software Engineer at Meta, that works on building scalable and... read more

Pouya Zadkhast

Meta

With an MSc degree in Computing Science from the University of Alberta, I started... read more

Vahid Jazayeri

Meta

Ariane has worked at Meta for over 8 years, on performance and reliability tooling,... read more

Ariane Jansen

Meta

Robert Grandl is a software engineer at Google, where he is working on Service... read more

Robert Grandl

Google

Pierre Palatin is a Staff Software Engineer in Site Reliability Engineering (SRE) at Google.... read more

Pierre Palatin

Google

Software Engineer in Thrift C++ and Serialization Team read more

Do Hyung (Dave) Kwon

Meta

TJ is from China. TJ has been working for Meta for 9 years after... read more

TJ Yin

Meta

Nandhini is an Engineering Manager, where she supports the Thrift team. Prior to joining... read more

Nandhini Santhanam

Meta

LATEST NOTES

Systems & Reliability @Scale

07/14/2023

System@Scale: AI Observability

Introduction The latest advancements in AI and the promising results delivered by many of our flagship models justify making considerable...

Systems & Reliability @Scale

07/11/2023

IPnext: Meta’s Next Generation Inference Platform

Background As the significance of AI continues to grow, it is fueling a wide array of products and services encompassing...

Systems & Reliability @Scale

07/11/2023

ServiceRouter: Hyperscale Service Mesh at Meta

The increasing need for continuous integration and delivery in data center environments has led to the widespread adoption of microservice...

Systems & Reliability @Scale

07/11/2023

The Meta Thrift Journey

Thrift is a framework consisting of Codegen, Serialization, and RPC (remote procedure call) for service communication. Here’s a diagram that...

past EVENT November 20-21, 2024 | Mobile, Video and Web

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...

PAST EVENT March 20, 2024 @ 9am PT - 3pm PT | Mobile, Video and Web

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

Past EVENT May 22, 2024 | Data, Machine Learning and AI

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...

Past EVENT June 12, 2024 | Systems and Networking

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...

Past EVENT JULY 31, 2024 @ 2:30 PM PDT - 7:00 PM PDT - IN PERSON EVENT | AUGUST 7, 2024 @ 2:30 PM PDT - 5:30 PM PDT - VIRTUAL PROGRAM | Data, Machine Learning and AI

AI Infra @Scale 2024

Meta’s Engineering and Infrastructure teams are excited to return for the second year in a row to host AI Infra @Scale on July 31. This year’s event is open to a limited number of in-person...

Past EVENT August 14, 2024 | Mobile, Video and Web

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. The @Scale community focuses on bringing forward people's experiences in creating innovative solutions to large-scale products serving millions or...

Past EVENT September 11, 2024 | Santa Clara Convention Center | Systems and Networking

Networking @Scale 2024

Meta’s Networking team invites you to Networking@scale on September 11th. This year’s event is an in-person event hosted at the Santa Clara Convention center and will also be live streamed for virtual attendees. Registration is...

Past EVENT October 9, 2024 | Systems and Networking

Reliability @Scale 2024

In the digital age, where systems operate at unprecedented scales, the importance of robust configuration management cannot be overstated. This year’s Reliability @Scale will focus on a central theme of "Move Safely", emphasizing the critical...

Past EVENT October 23, 2024 | Mobile, Video and Web

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...