Event times below are displayed in PT.
Systems @Scale Summer 2023 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering challenges. The @Scale community focuses on bringing people's experiences in the creation of innovative solutions.
Register today and check back for upcoming speaker and agenda announcements!
Systems @Scale Summer 2023 will be hosted virtually. Joining us are speakers from Alibaba, Anyscale, Google, Meta, Microsoft and Uber. The event will showcase a keynote address by Ion Stoica, Professor at UC Berkeley and Executive Chairman of DataBricks and AnyScale. It will also feature talks themed around four main topic areas: AI Platform, Hyperscale Infrastructure, Performance Management and Developer Experience.
Systems @Scale Summer 2023 is organized by: Hui Lei, Marius Eriksen, Jason Flinn, Vipul Patel, Wendy Tam and Hong Tang.
Event times below are displayed in PT.
Tuesday, July 18th
Wednesday, July 19th
The recent revolution of LLMs and Generative AI is triggering a sea change in virtually every industry. Building new AI applications or incorporating AI in existing applications require developers to stitch together and scale a plethora of workloads from data ingestions, pre-processing, training, tuning/finetuning and serving. This is a very challenging task as different workloads require different systems, each of these systems coming with its own APIs, semantics, and constraints. Ray can dramatically simplify building these applications by providing a unified framework that can support and scale all these workloads. As a result, Ray has been increasingly being used by companies across industries to build scalable ML infrastructures, platforms, and applications. Examples include Uber, Spotify, Instacart, Netflix, Cruise, Ant Group, ByteDance, and OpenAI (to train ChatGPT and other large models). In this talk, I will present the design considerations behind Ray, our experience with using Ray, and the lessons we learned in the process
Serving multi-terabyte models at planet scale is not an easy challenge. It takes everything engineers have in their arsenal to achieve the scale. The problem is further complicated by scientists' ambitions for complex and larger models. Adding in security requirements only adds more dimension to already challenging space. This talk is a collection of some of the lessons we have learned during the OpenAI model launch journey.
Today, machine learning plays a key role in Uber’s business, being used to make business critical decisions across the board from marketplace pricing, Eats search and discovery, maps ETA, fraud detection etc. Michelangelo is an end-to-end ML platform that democratizes machine learning and enables ML practitioners to seamlessly build, deploy, and operate machine learning solutions at Uber’s scale.
In this talk, we will discuss how Michelangelo addresses the challenges of streamlining ML developer experience with seamless UI and code driven model iteration, supporting large-scale deep learning with a declarative ML application framework, and consolidating fragmented ecosystems with a unified API framework inspired by Kubernetes CRD design pattern.
While Michelangelo had some industry leading features like Horovod, Palette feature store etc, the ML industry is rapidly evolving. We believe that the future of Michelangelo is an open ML platform that leverages the best-of-class 3rd party or in-house ML components. We will share some early experience on the plug-and-play of ML components in Michelangelo by using our unified API framework.
We built IPnext as a modular control plane that comprises multiple loosely coupled model management domains like deployment, update, and synchronization. We believe that our design promotes simplicity, flexibility, and reliability -- all necessary components to meet the demanding and dynamic requirements of ML inference systems.
Flux is a system that uses global routing patterns to accurately size backend services based on frontend traffic demand shifts
This report presents the practice and evolution of high-performance networking in Alibaba Cloud Storage services. High-performance networking plays a crucial role in storage systems providing not only communication channels for users and storage services but also achieving high-speed interconnect between storage components. It is the cornerstone of highly available, performant cloud storage services.
Datacenter applications are often structured as many interconnected microservices, and service mesh has become a popular approach to route RPC traffic among services. We present, ServiceRouter (SR), perhaps one of the world’s largest service meshes, which has been in production since 2012. SR differs from publicly known service meshes in several important ways. First, SR is designed for hyperscale and currently uses O(10^6 ) of L7 routers to route O(10^10) of requests per second across O(10^4 ) services. Second, in contrast to the common approach of using sidecar or remote proxies, SR employs an embedded routing library, which reduces the hardware cost of our hyperscale service mesh by O(10^5 ) machines. Third, SR provides built-in support for sharded services, which account for 92% of the total RPC requests in our fleet, whereas existing general-purpose service meshes do not support sharded services. Finally, SR introduces the concept of locality rings to simultaneously minimize RPC latency and balance load across geo-distributed regions, which to our knowledge has not been attempted before.
PolarDB is Alibaba Cloud’s flagship database product. Since its birth 5 years ago, it explores new hardware and new architecture extensively, such as Intel Optane Memory, SmartSSD and RDMA high speed network etc. Recently, on top of compute-storage disaggregation, we proposed an additional disaggregated memory layer. It acts as an efficient communication hub for our multi-master solution, and also provides additional memory for complex queries. This talk will introduce PolarDB’s overall architecture and how it benefits from its multi-level disaggregated architecture.
AI training and inference constitute a large section of Meta’s infrastructure. Executing AI workload requires fast and expensive compute hardware along with powerful networking systems. This poses new challenges to our observability system and also lies opportunities with great potential. In this talk, we present scalable observability infrastructure and tools that enable building faster and more efficient AI software, and how we leverage this data for predictive analysis of efficiency of jobs.
Stateless PHP web services are widely used at Meta and they provide great developer experience. However, the stateless nature avoids the services to serve any real-time product experience since they cannot keep a persistent client-server connection or subscribe to server-side events and the constantly-changing data in the social graph. At Meta, we have built a stateful engine with a rich capability set, called BladeRunner, that can be leveraged by PHP developers to build interactive products between client and server where the server-side business logic resides in PHP. In practice, BladeRunner emulates a stateful PHP service such that product developers do not deal with complexities of maintaining long-lived connections and stateful services.
Service Weaver (serviceweaver.dev) is a programming framework that makes it easy to write, deploy, and manage cloud applications in Go. It allows you to postpone some of the hard decisions about how to split your application into microservices until later, while enabling you to write fewer and better microservices. Service Weaver improves application latency by up to 15x and reduces infrastructure costs by up to 9x compared to state-of-the-art approaches. Finally, it allows you to deploy the same application binary locally and across multiple cloud environments.
Google deployments used to be managed by a large number of heterogeneous & complex workflows. Currently, most of production is maintained through continuous intent-based deployment, as the result of the work since 2015 on Prodspec & Annealing - Google's continuous deployment infrastructure. In this talk, we will briefly describe what we call "continuous intent-based deployment" and then go over the main principles that made Prodspec & Annealing work at scale: why intent generation should be a first class citizen? How to approach intent actuation with Select-Update-Validate? And how those principles, along with a couple others, made it possible to encode best practices across many services.
How we keep the development costs low amid Meta's Growth. To do this we: (1) unified multiple different frameworks into a single one and (2) expanded the thrift schema from passive data structure to encapsulation.
Hui Lei is a Director of Engineering at Meta and a Fellow of the... read more
Ion Stoica is a Professor in the EECS Department at the University of California... read more
Scale has been the central thread running through Manisha's career at Google (2005-2020), Meta... read more
Min Cai is a Distinguished Engineer at Uber working on the AI/ML platform (Michelangelo).... read more
Hitesh is a Software Engineer in the AI Infra team at Meta. Hitesh works... read more
Meta Engineer leading IPnext read more
Ayichew has been an Engineering Manager at Meta for over four years around broad... read more
Hayley has been a software engineer at Meta since 2019. She has helped develop... read more
Richard is an engineer in the Capacity Infrastructure org with a focus on resource... read more
Shu Ma is a senior staff engineer at Alibaba Cloud. He has been working... read more
Harshit joined Meta in 2017. Over the last six years, he contributed to various... read more
Margot is a tenured software engineer specialized in developing large-scale distributed systems at Meta.... read more
David Zhang is a staff engineer at Alibaba working on databases and storage systems.... read more
Riham is a Performance Engineer at Meta where she works on building tools to... read more
Valentin is a software performance engineer who works across teams to optimize AI software... read more
I am a Performance and Capacity Engineer at Meta Resource Foundation team, working on... read more
Lei Tian is a software engineer at Meta. He has been working in the... read more
Pouya Zadkhast is a Software Engineer at Meta, that works on building scalable and... read more
With an MSc degree in Computing Science from the University of Alberta, I started... read more
Ariane has worked at Meta for over 8 years, on performance and reliability tooling,... read more
Robert Grandl is a software engineer at Google, where he is working on Service... read more
Pierre Palatin is a Staff Software Engineer in Site Reliability Engineering (SRE) at Google.... read more
Software Engineer in Thrift C++ and Serialization Team read more
TJ is from China. TJ has been working for Meta for 9 years after... read more
Nandhini is an Engineering Manager, where she supports the Thrift team. Prior to joining... read more