Systems @Scale Summer 2023

JULY 18, 2023
JULY 19, 2023

Systems @Scale Summer 2023 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering challenges. The @Scale community focuses on bringing people's experiences in the creation of innovative solutions.

Register today and check back for upcoming speaker and agenda announcements!

RSVPS CLOSED
AGENDA SPEAKERS

ABOUT EVENT

Systems @Scale Summer 2023 will be hosted virtually. Joining us are speakers from Alibaba, Anyscale, Google, Meta, Microsoft and Uber. The event will showcase a keynote address by Ion Stoica, Professor at UC Berkeley and Executive Chairman of DataBricks and AnyScale. It will also feature talks themed around four main topic areas: AI Platform, Hyperscale Infrastructure, Performance Management and Developer Experience.

Systems @Scale Summer 2023 is organized by: Hui Lei, Marius Eriksen, Jason Flinn, Vipul Patel, Wendy Tam and Hong Tang.

EVENT AGENDA

Event times below are displayed in PT.

Day 1

Tuesday, July 18th

Day 2

Wednesday, July 19th

09:00 AM - 09:05 AM
Host Welcome
Speaker Hui Lei,Meta
Session 1: Keynote
09:05 AM - 09:25 AM
Ray, a unified distributed framework for the modern AI stack

The recent revolution of LLMs and Generative AI is triggering a sea change in virtually every industry. Building new AI applications or incorporating AI in existing applications require developers to stitch together and scale a plethora of workloads from data ingestions, pre-processing, training, tuning/finetuning and serving. This is a very challenging task as different workloads require different systems, each of these systems coming with its own APIs, semantics, and constraints. Ray can dramatically simplify building these applications by providing a unified framework that can support and scale all these workloads. As a result, Ray has been increasingly being used by companies across industries to build scalable ML infrastructures, platforms, and applications. Examples include Uber, Spotify, Instacart, Netflix, Cruise, Ant Group, ByteDance, and OpenAI (to train ChatGPT and other large models). In this talk, I will present the design considerations behind Ray, our experience with using Ray, and the lessons we learned in the process

Speaker Ion Stoica,AnyScale
09:25 AM - 09:35 AM
Q&A
Speaker Ion Stoica,AnyScale
Moderator Hui Lei,Meta
Session 2: AI Platform
09:35 AM - 09:55 AM
System for Scaling Large Language Models

Serving multi-terabyte models at planet scale is not an easy challenge. It takes everything engineers have in their arsenal to achieve the scale. The problem is further complicated by scientists' ambitions for complex and larger models. Adding in security requirements only adds more dimension to already challenging space. This talk is a collection of some of the lessons we have learned during the OpenAI model launch journey.

Speaker Manisha Jain,Microsoft
09:55 AM - 10:15 AM
Michelangelo ML Platform at Uber: Past, Present and Future

Today, machine learning plays a key role in Uber’s business, being used to make business critical decisions across the board from marketplace pricing, Eats search and discovery, maps ETA, fraud detection etc. Michelangelo is an end-to-end ML platform that democratizes machine learning and enables ML practitioners to seamlessly build, deploy, and operate machine learning solutions at Uber’s scale.

In this talk, we will discuss how Michelangelo addresses the challenges of streamlining ML developer experience with seamless UI and code driven model iteration, supporting large-scale deep learning with a declarative ML application framework, and consolidating fragmented ecosystems with a unified API framework inspired by Kubernetes CRD design pattern.

While Michelangelo had some industry leading features like Horovod, Palette feature store etc, the ML industry is rapidly evolving. We believe that the future of Michelangelo is an open ML platform that leverages the best-of-class 3rd party or in-house ML components. We will share some early experience on the plug-and-play of ML components in Michelangelo by using our unified API framework.

Speaker Min Cai,Uber
10:15 AM - 10:30 AM
IPnext: Next Generation Inference Platform

We built IPnext as a modular control plane that comprises multiple loosely coupled model management domains like deployment, update, and synchronization. We believe that our design promotes simplicity, flexibility, and reliability -- all necessary components to meet the demanding and dynamic requirements of ML inference systems.

Speaker Hitesh Khandelwal,Meta
Speaker Chao Xie,Meta
Featured Blog
IPNEXT: META’S NEXT GENERATION INFERENCE PLATFORM  read more
10:30 AM - 10:45 AM
Q&A
Speaker Manisha Jain,Microsoft
Speaker Min Cai,Uber
Speaker Hitesh Khandelwal,Meta
Speaker Chao Xie,Meta
Moderator Ayichew Hailu,Meta
10:45 AM - 10:55 AM
Break
Session 3: Hyperscale Infrastructure
10:55 AM - 11:15 AM
Global Capacity Management with Flux

Flux is a system that uses global routing patterns to accurately size backend services based on frontend traffic demand shifts

Speaker Hayley Russell,Meta
Speaker Richard Cornew,Meta
11:15 AM - 11:35 AM
The Evolution of High Performance Networks: A Storage Perspective and Practices at Alibaba Cloud

This report presents the practice and evolution of high-performance networking in Alibaba Cloud Storage services. High-performance networking plays a crucial role in storage systems providing not only communication channels for users and storage services but also achieving high-speed interconnect between storage components. It is the cornerstone of highly available, performant cloud storage services.

Speaker Shu Ma,Alibaba
11:35 AM - 11:55 AM
ServiceRouter: Hyperscale Service Mesh at Meta

Datacenter applications are often structured as many interconnected microservices, and service mesh has become a popular approach to route RPC traffic among services. We present, ServiceRouter (SR), perhaps one of the world’s largest service meshes, which has been in production since 2012. SR differs from publicly known service meshes in several important ways. First, SR is designed for hyperscale and currently uses O(10^6 ) of L7 routers to route O(10^10) of requests per second across O(10^4 ) services. Second, in contrast to the common approach of using sidecar or remote proxies, SR employs an embedded routing library, which reduces the hardware cost of our hyperscale service mesh by O(10^5 ) machines. Third, SR provides built-in support for sharded services, which account for 92% of the total RPC requests in our fleet, whereas existing general-purpose service meshes do not support sharded services. Finally, SR introduces the concept of locality rings to simultaneously minimize RPC latency and balance load across geo-distributed regions, which to our knowledge has not been attempted before.

Speaker Harshit Saokar,Meta
Speaker Margot Leibold,Meta
Featured Blog
SERVICEROUTER: HYPERSCALE SERVICE MESH AT META  read more
11:55 PM - 12:10 PM
Q&A
Speaker Hayley Russell,Meta
Speaker Richard Cornew,Meta
Speaker Shu Ma,Alibaba
Speaker Harshit Saokar,Meta
Speaker Margot Leibold,Meta
Moderator Ayichew Hailu,Meta
Session 4: Performance Management
09:00 AM - 09:20 AM
PolarDB: A Cloud Native Database of Disaggregated Architecture

PolarDB is Alibaba Cloud’s flagship database product. Since its birth 5 years ago, it explores new hardware and new architecture extensively, such as Intel Optane Memory, SmartSSD and RDMA high speed network etc. Recently, on top of compute-storage disaggregation, we proposed an additional disaggregated memory layer. It acts as an efficient communication hub for our multi-master solution, and also provides additional memory for complex queries. This talk will introduce PolarDB’s overall architecture and how it benefits from its multi-level disaggregated architecture.

Speaker David Zhang,Alibaba
09:20 AM - 09:45 AM
AI Observability At Meta Scale

AI training and inference constitute a large section of Meta’s infrastructure. Executing AI workload requires fast and expensive compute hardware along with powerful networking systems. This poses new challenges to our observability system and also lies opportunities with great potential. In this talk, we present scalable observability infrastructure and tools that enable building faster and more efficient AI software, and how we leverage this data for predictive analysis of efficiency of jobs.

Speaker Riham Selim,Meta
Speaker Valentin Andrei,Meta
Speaker Hao Wang,Meta
Speaker Lei Tian,Meta
FEATURED BLOG
System@Scale: AI Observability  read more
09:45 AM - 10:05 AM
Stateful Web Service: A Distributed Lambda Function for Client-Server Interactions

Stateless PHP web services are widely used at Meta and they provide great developer experience. However, the stateless nature avoids the services to serve any real-time product experience since they cannot keep a persistent client-server connection or subscribe to server-side events and the constantly-changing data in the social graph. At Meta, we have built a stateful engine with a rich capability set, called BladeRunner, that can be leveraged by PHP developers to build interactive products between client and server where the server-side business logic resides in PHP. In practice, BladeRunner emulates a stateful PHP service such that product developers do not deal with complexities of maintaining long-lived connections and stateful services.

Speaker Pouya Zadkhast,Meta
Speaker Vahid Jazayeri,Meta
Featured Blog
STATEFUL PHP WEB SERVICE: A DISTRIBUTED LAMBDA FUNCTION FOR CLIENT-SERVER INTERACTIONS  read more
10:05 AM - 10:25 AM
Q&A
Speaker Pouya Zadkhast,Meta
Speaker Riham Selim,Meta
Speaker Vahid Jazayeri,Meta
Speaker Valentin Andrei,Meta
Speaker Hao Wang,Meta
Speaker Lei Tian,Meta
Speaker David Zhang,Alibaba
Moderator Ariane Jansen,Meta
10:25 AM - 10:35 AM
Break
Session 5: Developer Experience
10:35 AM - 10:55 AM
Service Weaver: A Framework for Writing Distributed Systems

Service Weaver (serviceweaver.dev) is a programming framework that makes it easy to write, deploy, and manage cloud applications in Go. It allows you to postpone some of the hard decisions about how to split your application into microservices until later, while enabling you to write fewer and better microservices. Service Weaver improves application latency by up to 15x and reduces infrastructure costs by up to 9x compared to state-of-the-art approaches. Finally, it allows you to deploy the same application binary locally and across multiple cloud environments.

Speaker Robert Grandl,Google
10:55 AM - 11:15 AM
Prodspec & Annealing: Intent Based Deployment at Google

Google deployments used to be managed by a large number of heterogeneous & complex workflows. Currently, most of production is maintained through continuous intent-based deployment, as the result of the work since 2015 on Prodspec & Annealing - Google's continuous deployment infrastructure. In this talk, we will briefly describe what we call "continuous intent-based deployment" and then go over the main principles that made Prodspec & Annealing work at scale: why intent generation should be a first class citizen? How to approach intent actuation with Select-Update-Validate? And how those principles, along with a couple others, made it possible to encode best practices across many services.

Speaker Pierre Palatin,Google
11:15 AM - 11:35 AM
The Meta Thrift Journey: O (100 Billion QPS) Moved

How we keep the development costs low amid Meta's Growth. To do this we: (1) unified multiple different frameworks into a single one and (2) expanded the thrift schema from passive data structure to encapsulation.

Speaker Do Hyung (Dave) Kwon,Meta
Speaker TJ Yin,Meta
Speaker Nandhini Santhanam,Meta
ADDITIONAL RESOURCES
THE META THRIFT JOURNEY  read more
11:35 AM - 11:50 AM
Q&A
Speaker Robert Grandl,Google
Speaker Pierre Palatin,Google
Speaker Do Hyung (Dave) Kwon,Meta
Speaker TJ Yin,Meta
Speaker Nandhini Santhanam,Meta
Moderator Ariane Jansen,Meta
11:50 AM - 11:55 AM
Closing Remarks
Speaker Hui Lei,Meta

SPEAKERS AND MODERATORS

Hui Lei is a Director of Engineering at Meta and a Fellow of the... read more

Hui Lei

Meta

Ion Stoica is a Professor in the EECS Department at the University of California... read more

Ion Stoica

AnyScale

Scale has been the central thread running through Manisha's career at Google (2005-2020), Meta... read more

Manisha Jain

Microsoft

Min Cai is a Distinguished Engineer at Uber working on the AI/ML platform (Michelangelo).... read more

Min Cai

Uber

Hitesh is a Software Engineer in the AI Infra team at Meta. Hitesh works... read more

Hitesh Khandelwal

Meta

Meta Engineer leading IPnext read more

Chao Xie

Meta

Ayichew has been an Engineering Manager at Meta for over four years around broad... read more

Ayichew Hailu

Meta

Hayley has been a software engineer at Meta since 2019. She has helped develop... read more

Hayley Russell

Meta

Richard is an engineer in the Capacity Infrastructure org with a focus on resource... read more

Richard Cornew

Meta

Shu Ma is a senior staff engineer at Alibaba Cloud. He has been working... read more

Shu Ma

Alibaba

Harshit joined Meta in 2017. Over the last six years, he contributed to various... read more

Harshit Saokar

Meta

Margot is a tenured software engineer specialized in developing large-scale distributed systems at Meta.... read more

Margot Leibold

Meta

David Zhang is a staff engineer at Alibaba working on databases and storage systems.... read more

David Zhang

Alibaba

Riham is a Performance Engineer at Meta where she works on building tools to... read more

Riham Selim

Meta

Valentin is a software performance engineer who works across teams to optimize AI software... read more

Valentin Andrei

Meta

I am a Performance and Capacity Engineer at Meta Resource Foundation team, working on... read more

Hao Wang

Meta

Lei Tian is a software engineer at Meta. He has been working in the... read more

Lei Tian

Meta

Pouya Zadkhast is a Software Engineer at Meta, that works on building scalable and... read more

Pouya Zadkhast

Meta

With an MSc degree in Computing Science from the University of Alberta, I started... read more

Vahid Jazayeri

Meta

Ariane has worked at Meta for over 8 years, on performance and reliability tooling,... read more

Ariane Jansen

Meta

Robert Grandl is a software engineer at Google, where he is working on Service... read more

Robert Grandl

Google

Pierre Palatin is a Staff Software Engineer in Site Reliability Engineering (SRE) at Google.... read more

Pierre Palatin

Google

Software Engineer in Thrift C++ and Serialization Team read more

Do Hyung (Dave) Kwon

Meta

TJ is from China. TJ has been working for Meta for 9 years after... read more

TJ Yin

Meta

Nandhini is an Engineering Manager, where she supports the Thrift team. Prior to joining... read more

Nandhini Santhanam

Meta

LATEST NOTES

Systems @Scale
07/14/2023
System@Scale: AI Observability
Introduction The latest advancements in AI and the promising results delivered by many of our flagship models justify making considerable...
Systems @Scale
07/11/2023
IPnext: Meta’s Next Generation Inference Platform
Background As the significance of AI continues to grow, it is fueling a wide array of products and services encompassing...
Systems @Scale
07/11/2023
ServiceRouter: Hyperscale Service Mesh at Meta
The increasing need for continuous integration and delivery in data center environments has led to the widespread adoption of microservice...
Systems @Scale
07/11/2023
The Meta Thrift Journey
Thrift is a framework consisting of Codegen, Serialization, and RPC (remote procedure call) for service communication. Here’s a diagram that...
UPCOMING EVENT   November 20-21, 2024 | Video @Scale

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...
PAST EVENT   March 20, 2024 @ 9am PT - 3pm PT | RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...
Past EVENT   May 22, 2024 | Data @Scale

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...
Past EVENT   June 12, 2024 | Systems @Scale

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...
Past EVENT   JULY 31, 2024 @ 2:30 PM PDT - 7:00 PM PDT - IN PERSON EVENT | AUGUST 7, 2024 @ 2:30 PM PDT - 5:30 PM PDT - VIRTUAL PROGRAM | AI Infra @Scale

AI Infra @Scale 2024

Meta’s Engineering and Infrastructure teams are excited to return for the second year in a row to host AI Infra @Scale on July 31. This year’s event is open to a limited number of in-person...
Past EVENT   August 14, 2024 | Product @Scale

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. The @Scale community focuses on bringing forward people's experiences in creating innovative solutions to large-scale products serving millions or...
Past EVENT   September 11, 2024 | Santa Clara Convention Center | Networking @Scale

Networking @Scale 2024

Meta’s Networking team invites you to Networking@scale on September 11th. This year’s event is an in-person event hosted at the Santa Clara Convention center and will also be live streamed for virtual attendees. Registration is...
Past EVENT   October 9, 2024 | Reliability @Scale

Reliability @Scale 2024

In the digital age, where systems operate at unprecedented scales, the importance of robust configuration management cannot be overstated. This year’s Reliability @Scale will focus on a central theme of "Move Safely", emphasizing the critical...
Past EVENT   October 23, 2024 | Mobile @Scale

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy