AI @Scale 2022Sep 28, Virtual
Register
RTC Slider 1 desktop
RTC @Scale 2022

RTC @Scale Live Panel – RTC in the Metaverse with Sriram Srinivasan, Mike Arcuri, Paul Boustead, and Cullen Jennings.

Watch video

Systems @Scale Winter 2022

Call for Systems@Scale Winter 2022 Presentations

TLDR: We are looking for Systems@Scale presentations! Please submit your proposals by Sept 26. Systems@Scale is a part of the @Scale conference for engineers that manage and deal with large scale information systems serving millions of ...
Reliability @Scale 2022

Lessons Learned from the Halloween Outage

In this talk, VP of Engineering Max Ross will discuss the 73 hour outage that impacted Roblox late last year. He will also share some of the ways that a multi-day outage can turn conventional reliability wisdom on its head.
Reliability @Scale 2022

QUIC Exit: Exposing a New Class of Outage

A crash bug in QUIC handshake code exposed a new class of bugs we termed ‘contagion bugs’. For these bugs, a tiny number of tasks can cause a huge outage and rollbacks don’t work as expected. This talk details what contagion bugs are, ...
Reliability @Scale 2022

Service Incident Deep Dive: Technical Overview & Learnings

This talk will provide a technical overview of a service incident on the Akamai platform in July 2021 which, despite layers of safety technologies, nevertheless impacted some of Akamai’s customers. In addition to exploring the ...
Reliability @Scale 2022

Lessons From Long-Running Investigations

In this talk, we share some lessons from several of our long-running investigations. Some of them are well-known, but are worth repeating, and some of them are things we learned and want to share.
Reliability @Scale 2022

Pipefail Overview and Discussion

On December 16, 2021, an unlikely series of distributed system events slowed down customer requests through the Cloudflare system for a period of approximately 30 minutes. This talk gives an overview of how unexpected outcomes are ...
Reliability @Scale 2022

Improving Reliability @ Meta: By Analyzing Historical Events That Led to SLO Violations

Learn about culture of tracking Service Level Indicators/Service Level Objectives at Instagram specifically and Meta in general, the tools that we use and how teams’ SLI/SLO workflows can be improved by annotating SLO violations ...
Reliability @Scale 2022

Service Degradation at Scale: Creating Instant Capacity

We will talk about what factors made us realize that service degradation is necessary for our infrastructure and the challenges we faced while implementing service degradation at scale. We will also speak about how we are changing our ...
Reliability @Scale 2022

Reliably Changing Configuration @ Scale

Thousands of services at Meta use Configuration Management, so it is important we change that configuration reliably. Tune in for a story spanning several years, covering how we exponentially grew coverage of a protection mechanism for ...
Reliability @Scale 2022

Meta’s SEV Culture: How Today’s SEVs Create Tomorrow’s Reliability

Would you believe us if we said the more SEVs we have, the more reliable we are? In this talk we’ll talk about the reasons why we love SEVs at Meta, and how our culture around SEVs has allowed us to build reliable services at ...
Summer Systems @Scale 2022

Cache Made Consistent – Cache invalidation might no longer be a hard thing in Computer Science

Cache invalidation is considered one of the hardest things in Computer Science. We, at Meta, operate some of the world’s largest cache deployments (e.g. Memcache and TAO), serving more than one quadrillion queries a day. We have ...
Summer Systems @Scale 2022

Leveraging Data in Motion in a Cloud-first World

Apache Kafka has emerged to the de-facto standard for event streaming platform in enterprise architectures. Many business applications are moving away from data-at-rest to an event-driven architecture so that they could leverage the ...
Summer Systems @Scale 2022

Introducing Zelos – Zookeeper API leveraging Delos

In this presentation we will introduce Zelos. Zelos provides the exact same semantics as ZooKeeper but is built using Delos. ZooKeeper forms the foundation of Meta’s infrastructure stack and we have been using it over a decade. ...
Summer Systems @Scale 2022

Hosting Open Source Relational Databases at Scale on Microsoft Azure

Hosting managed relational database services in the cloud with the level of availability, reliability guarantees demanded by mission critical workloads and doing it at scale presents a set of interesting challenges. This talk will walk ...
Summer Systems @Scale 2022

How Meta Keeps its Large-scale Infrastructure Hardware Up and Running

Internet services like Facebook, Instagram, and Whatsapp rely on large-scale infrastructure to support the various compute, storage, and AI workloads. With the support of data and ML techniques, we can scale our infrastructure ...
Summer Systems @Scale 2022

DADI @Scale: Deploying Containers at Scale in Alibaba

Alibaba Cloud offers a comprehensive suite of elastic computing services that are based on container technology. Alibaba Group is one of the key customers of Alibaba Cloud and all of the major applications across its large and diverse ...
Summer Systems @Scale 2022

Scaling End to End Reliability Tracking Across Large Scale, Multiplexed Products and Services

This talk introduces a new user experience-focused reliability measurement that exposes end-to-end reliability guarantees across the vertical service stack used by Meta’s family of Apps. The talk discusses the difference between the ...
Summer Systems @Scale 2022

Don’t Ship the Org Chart: Rebuilding Istio for User Maintainability

While the cry of “breaking apart the monolith” can be heard throughout the industry, the Istio service mesh took a different tack, and consolidated its control plane microservices into one binary. How did we get here? In ...
Summer Systems @Scale 2022

Configuration Safety at Scale with Ads

The Configerator repository provides Meta developers with a way to make changes easily and quickly to production services. By default, it pushes changes to all services at Meta in a matter of seconds, and doesn’t have the traditional ...
Summer Systems @Scale 2022

Getting from Schemaless Ingest to Fast SQL at Rockset

Rockset provides low-latency SQL access to schemaless data that is ingested in real-time. Immediate access to dynamically structured data is very powerful, enabling rapid development and iteration for products built on top, but it ...
Summer Systems @Scale 2022

The Ent Framework: Meta’s Object-Relational Mapping

When you think about Meta’s family of apps, what comes to mind? Maybe the over 6 thousand photos and videos created per second on Instagram, the 5 trillion photos on Facebook, or the 60 million group posts loaded each second. It’s ...
Summer Systems @Scale 2022

Infra Cloud Service Platform (ICSP)

Building and operating a service is challenging and complex. At scale, service owners need to consider a number of responsibilities including how they develop, deploy, scale and monitor their service in production. Each of these ...
Summer Systems @Scale 2022

Lessons Learned from Scaling Infrastructure as Code

You adopted an infrastructure as code tool like Terraform. What started as one person writing some configuration and deploying new infrastructure scales to everyone in the company writing their own infrastructure configuration and ...
Summer Systems @Scale 2022

Global Capacity Management at Meta

Meta currently operates more than 15 data center regions around the world. This rapidly expanding global datacenter footprint poses new challenges for service owners and for our infrastructure management systems. In this talk, we will ...
Products @Scale Spring 2022

Keynote | Ime Archibong

Ime Archibong, head of New Product Experimentation (NPE) at Meta, and 12-year Meta veteran will talk about 0-1 innovation, at scale. He’ll discuss the value of experimentation as an approach, and demystify how real breakthroughs happen.
Products @Scale Spring 2022

Building a Cross-platform Runtime for AR Experiences | Nikita Lutsenko and Paul Wu

There are many tools that Creators can use to build novel AR experiences. However not many of these tools can deliver a wide ranging set of capabilities and creative assets to billions of devices with both quality and speed. In this ...
Products @Scale Spring 2022

Challenges and opportunities for building crowdsourced mapping services for autonomous driving at scale | Ruchi Bhargava

NVIDIA Map aggregates data from millions of NVIDIA DRIVE Hyperion consumer and survey data-collection vehicles for safe, reliable, and up-to-date global high-def map coverage. The platform supports automated driving functionality from ...
Products @Scale Spring 2022

Scaling Messenger Product Development | Joshua Evenson

As a mobile app grows in users, features, and contributing engineers, there are often tradeoffs between the performance of the app and the velocity of feature growth. Messenger’s users have high performance expectations, so ...
Products @Scale Spring 2022

ML Algorithms for Trust and Safety @ YouTube | Emre Sargin

In this talk, I’ll be providing an overview of how we use ML algorithms to detect policy violative content on YouTube across all entity types: videos, comments, livestreams, engagements, etc and keep our community safe. ML ...
Products @Scale Spring 2022

Building private products at WhatsApp | Aleksander Bello

An overview of how WhatsApp thinks of privacy in the messaging world. We’ll go through some of our general principles, concrete product use cases, and challenges that come with privacy at scale.
Products @Scale Spring 2022

Scaling ML workflows for real-time moderation challenges at Twitch | Lukas Tencer, Lena Evans, Shiming Ren

Trust & Safety at Twitch is uniquely challenging, as the vast majority of content and chat interactions unfold in real time, across a wide variety of communities with different needs, cultures, and audiences. Mitigating and ...
Products @Scale Spring 2022

Live Panel: The Good, the Bad and the Glory of Building Products @Scale

Participants: Sara Wong (Meta), Xiao Li (Meta), Boulos Harb (Level), Daniel Jacobson (Google)
Products @Scale Spring 2022

Keynote – Building Products at Scale | Vijaye Raji

Building successful products is hard. Building successful products at scale? Ridiculously hard! It takes strong vision, deep dedication, consistent execution, with a healthy sprinkle of unorthodox methods. This talk shares a few ...
Products @Scale Spring 2022

Building visually stunning products at Scale at Instagram | Steph Rhee and Laycee Berkas

We’ll talk about two 0-1 products in the creator space: Subscriptions and Music Releases on IG. We’ll walk through how we built the early stages of these as visually stunning products, as well as the unique set of challenges our teams ...
Products @Scale Spring 2022

The Evolution of Facebook’s Mobile App Architecture | Dustin Shahidehpour

In 2007, Facebook released their first iOS App. It was written in HTML, and it was supported by a single engineer. Since then, the Facebook iOS App has grown into a native ‘platform’ which supports more than 100 products, and hundreds ...
Products @Scale Spring 2022

Mobile Development @ Scale | Chad Landis

At Capital One, building beautiful, rich, and performant mobile applications for iOS and Android is essential to providing a best-in-class experience for our customers and delivering on our mission to change banking for good. However, ...
Products @Scale Spring 2022

Live Panel: Cross Platform Product Development @Scale

Participants: Jason Grandelli (Meta), Dan Schafer (Meta), Denise Noyes (Meta), Kevin Galligan (TouchLab), Vishnu Nath (Microsoft)
Networking @Scale Summer 2022

The Future With QUIC | Jana Iyengar

We’ve all heard much about QUIC in the past few years, and a lot has been made of its performance benefits for HTTP/3. For some of us however, HTTP/3 was always just the beginning, just the vehicle for us to get QUIC out into the ...
Networking @Scale Summer 2022

Quick Cache DSR | Matt Joras and Yair Gottdenker

In a typical CDN architecture the caching tier is fronted by a load-balancing tier; response content flows from the cache to the requester through the load-balancer. With this architecture extra I/O, CPU cycles and intra-cluster ...
Networking @Scale Summer 2022

Improving Transfer Times in the Backbone Network Using QUIC Jump Start | Joseph Beshay

Transfers in high-BDP links incur a startup delay for congestion control to probe the bandwidth of the underlying link. The impact of this delay is inversely proportional to the size of the transfer since small transfers may repeatedly ...
Networking @Scale Summer 2022

LIVE Q&A | Moderated by Bharat Parekh

LIVE Q&A featuring Jana Iyengar, Matt Joras, Yair Gottdenker & Joseph Beshay
Networking @Scale Summer 2022

Layer Four and Three Quarters: Fantastic Quirks and Where to Find Them | Lucas Pardue

Nestled between transport protocols (TCP, UDP, QUIC) and application protocols (HTTP, etc.) is a layer few are familiar with. Layer 4¾ sits hiding in plain sight, often only being glimpsed during curious events that raise its ...
Networking @Scale Summer 2022

The Challenges of 0-RTT in IETF QUIC | Ian Swett

A key feature of HTTP/3 over QUIC is the ability to send a request in the first flight with the ClientHello. 0-RTT in IETF QUIC is notably more complex than gQUIC, with multiple packet number spaces and a limit on the amplification ...
Networking @Scale Summer 2022

Tackling DC Congestion and Bursts | Balasubramanian Madhavan and Abhishek Dhamija

A talk about two specific DC transport tuning initiatives (a) handling sustained congestion in the network (b) tackling bursts in network. Covers the motivation, implementation overview, wins and lessons learnt for both these initiatives.
Networking @Scale Summer 2022

NetEdit: Fine-grained Network Tuning at Scale | Prashanth Kannan and Prankur Gupta

We will share the design, implementation, and production experience of BPF based platform used to tune the network transport across millions of servers at Meta.
Networking @Scale Summer 2022

LIVE Q&A | Moderated by Neil Spring

LIVE Q&A featuring Prashanth Kannan, Balasubramanian Madhavan, Abhishek Dhamija, Prankur Gupta & Kumar Saurabh Arora
Networking @Scale Summer 2022

NATless IPv6/IPv4 Address Translation | Keerti Lakshminarayan and Alok Tiagi

We will demonstrate a performant and novel approach to performing NAT, that uses a unique transition mechanism utilizing a new flag introduced to the seccomp() system call, to intercept egress connect calls to opportunistically use a ...
Networking @Scale Summer 2022

Network Entitlement: From Hose-based Approval to Host-based Admission | Guanqing Yan and Manikandan Somasundaram

The Wide Area Network (WAN) connects many datacenter (DC) regions and hundreds of Points of Presence (POPs) of Meta. The WAN resource is shared by several high network demand services at Meta. The network must be built for peak demand ...
Networking @Scale Summer 2022

LIVE Q&A | Moderated by Ying Zhang

LIVE Q&A featuring Keerti Lakshminarayan, Alok Tiagi, Guanqing Yan, Manikandan Somasundaram & Jitu Padhye
Data @Scale Spring 2022

Automated Model Update & Evaluation

This talk breaks down stage-by-stage requirements and challenges for online prediction and fully automated, on-demand continual learning. We’ll also discuss key design decisions a company might face when building or adopting a machine ...
Data @Scale Spring 2022

Real-Time Data Processing for ML Feature Engineering

In Meta, we had developed multiple real-time data processing infrastructure like Puma, Stylus and Turbine (SIGMOD ’16 and ICDE ’20). As Meta grows, the needs for real-time data has grown way beyond traditional data ...
Data @Scale Spring 2022

Scalable Data Transportation & Ingestion with MemQ

Machine learning is at the heart of Pinterest and is powered by large scale ML training log collection. To solve the cost efficient data ingestion & transportation problem at Pinterest we developed MemQ, a PubSub system that ...
Data @Scale Spring 2022

ML Monitoring & Observability @Meta Scale

ML generates significant value for Meta’s infrastructure, tools, products, and users. It drives a varied set of insights; from end-user products such as recommendations and feeds on Facebook and Instagram, to infrastructure insights ...
Data @Scale Spring 2022

Enabling Machine Learning through Real-Time Data Processing using Rockset

Data Infrastructure has evolved in the last 15 years from Hadoop’s batch system, to streaming systems like Spark and Kafka and now to realtime systems like Rockset and Clickhouse. Automatic decision making based on massive data ...
Data @Scale Spring 2022

TorchData and TorchArrow: Data Preprocessing for ML at Production Scale

The problem of deep learning and building large scale systems for production is not just one of model training, but data preprocessing as well. At production scale, just the data loading and processing part of the system can cause ...
Data @Scale Spring 2022

Making Data Quality an integral part of developing Machine Learning and Data Products

“Machine Learning models are only as good as the data that was used to train them. Datasets are often plagued with problems such as quality, discoverability, and undesirable social biases. As data and modeling tools are becoming ...
Data @Scale Spring 2022

Minimize Risks and Accelerate MLOps with Model Performance Monitoring and Explainability

We’re truly living under the rule of Algorithms, our day-to-day activities from news consumption, job search, and mortgage financing are increasingly being decided by algorithms. Most of these algorithms are AI-based and are ...
Systems @Scale Spring 2022

Q&A | Moderated by Francois Richard. Featuring Yuri Grinshteyn, Jie Huang, Christopher Bunn, Osama Abuelsorour, Amr Mahdi, Jason Flinn & Arushi Aggarwal

Q&A | Moderated by Francois Richard. Featuring Yuri Grinshteyn, Jie Huang, Christopher Bunn, Osama Abuelsorour, Amr Mahdi, Jason Flinn & Arushi Aggarwal
Systems @Scale Spring 2022

Owl | Arushi Aggarwal & Jason Flinn

We will describe Owl, a new system for high-fanout distribution of large data objects to hosts in Meta’s private cloud. Owl distributes over 700 petabytes of data per day to millions of client processes. It has improved download ...
Systems @Scale Spring 2022

Southpaw: Token-based service load balancing, scaling and QoS system | Osama Abuelsorour & Amr Mahdi

Southpaw is load balancing, scaling and QoS management system for compute-heavy inferencing services. It takes the approach of abstracting services capabilities into tokens and worklanes, where clients are granted tokens that gives ...
Systems @Scale Spring 2022

Vacuum Testing for Resiliency: Verifying Disaster Recovery in Complex | Jie Huang & Christopher Bunn

Engineers at Meta run thousands of services across millions of machines, and those services all have similar needs that can’t be managed by hand: configuration, deployment, monitoring, routing, orchestration, security. To solve the ...
Systems @Scale Spring 2022

Shrinking the Impact of Production Incidents | Yuri Grinshteyn

Shrinking Production Incidents details an organized approach for reducing the overall impact of production outages. Attendees can expect to learn how to prioritize reliability-related engineering tasks based on incident postmortem ...
Systems @Scale Spring 2022

Highly Available and Strongly Consistent Storage Service Using Chain Replication | Kumar Mrinal & Binbin Lu

Highly Available and Strongly Consistent Storage Service Using Chain Replication | Kumar Mrinal & Binbin Lu – In this talk, we present Dumbo – a simple, reliable, highly available, low dependency object storage system ...
Systems @Scale Spring 2022

ACS: De-Identified Authentication at Scale | Shiv Kushwah & Haozhi Xiong

Privacy is core to Meta engineering culture, and one of our fundamental principles is data minimization. We strive to collect and create the minimum amount of data required to provide service. One critical space we’ve identified across ...
Systems @Scale Spring 2022

The Cosmos Big Data Platform at Microsoft: Over a Decade of Progress and a Decade to Look Forward | Ivan Santa Maria Filho

Cosmos is the exabyte-scale big data platform at Microsoft, and SCOPE is its main analytics engine. SCOPE and Cosmos support ETL pipelines, decision support systems, and machine learning pipelines. Applications range from simple ...
Systems @Scale Spring 2022

Scaling Data Ingestion for ML Training at Meta | Aarti Basant

AI models drive several Meta products like News Feed, Ads, IG Reels, language translation to name a few. Our ranking models consume massive datasets to continuously improve user experience on our platform. In this talk, we discuss our ...
Systems @Scale Spring 2022

Cassandra@Scale: A Deep Dive into Apache Cassandra 4.0 | Dinesh Joshi

At almost two years in the making Apache Cassandra 4.0 is here. With a focus on performance and stability, it is full of interesting features. This talk takes you through a tour of the new features and performance improvements. From ...
Systems @Scale Spring 2022

Manifold: Storage Platform Consolidation | Jacob Lacouture

Since 2016, we’ve built, deployed, and scaled a new BLOB storage platform at Meta, called Manifold. Manifold builds on existing BLOB storage infrastructure, but provides a richer, higher-level, general purpose API, and thereby enables ...
Systems @Scale Spring 2022

Transparent Memory Offloading @Meta | Niket Agarwal, Dan Schatzberg, Johannes Weiner

The unrelenting growth of the memory needs of emerging data center applications, along with ever-increasing cost and volatility of DRAM prices, has led to DRAM being a major infrastructure expense. Alternative technologies, such as ...
RTC @Scale 2022

Real-time Communication for Today and Future Experiences – Maher Saba

Real-time Communication for Today and Future Experiences – Maher Saba
RTC @Scale 2022

Holographic Video Calling – Nitin Garg

Holographic Video Calling – Nitin Garg During COVID, the importance of video calling grew as people were stuck and separated from their family and friends. But the 2D experience falls short of making you feel present in the same ...
RTC @Scale 2022

Spatial Communications at Scale in Virtual Environments – Paul Boustead

Spatial Communications at Scale in Virtual Environments – Paul Boustead Talking with a group of friends face-to-face can be very engaging, with fast-paced turn-taking and overlapping conversations. You can even have such a ...
RTC @Scale 2022

RTC3 – Justin Uberti

RTC3 – Justin Uberti The real-time communications industry has evolved rapidly since the release of Skype in 2003, and saw unprecedented growth during the COVID-19 pandemic. This talk will look at the trends of the last 20 years ...
RTC @Scale 2022

RTC @Scale, Future RTC Experiences – Live Q&A

RTC @Scale, Future RTC Experiences – Live Q&A RTC @Scale, Future RTC Experiences Session – Live Q&A with Nitin Garg, Paul Boustead, Justin Uberti, and Rahul Gowda
RTC @Scale 2022

Developing Machine Learning Based Speech Enhancement Models for Teams and Skype – Ross Cutler

Developing Machine Learning Based Speech Enhancement Models for Teams and Skype – Ross Cutler Microsoft Teams and Skype are used daily by hundreds of millions of users, and their usage has increased significantly since the ...
RTC @Scale 2022

Can AI Disrupt Speech Compression? – Jan Skoglund

Can AI Disrupt Speech Compression? – Jan Skoglund AI and deep learning has radically advanced many speech and audio processing applications. For example, we have all experienced improvements in speech recognition and synthesis in ...
RTC @Scale 2022

RTC @Scale, Audio ML – Live Q&A

RTC @Scale, Audio ML – Live Q&A RTC @Scale, Audio ML Session – Live Q&A with Ross Cutler and Jan Skoglund
RTC @Scale 2022

RTC @Scale Live Panel – RTC in the Metaverse

RTC @Scale Live Panel – RTC in the Metaverse RTC @Scale Live Panel with Sriram Srinivasan, Mike Arcuri, Paul Boustead, and Cullen Jennings.
RTC @Scale 2022

AV1 Encoder for RTC – Marco Paniconi

AV1 Encoder for RTC – Marco Paniconi In this presentation we discuss the various features and techniques that make libaom AV1 encoder suitable for RTC applications: from encoding tool selection to reducing complexity of the ...
RTC @Scale 2022

AV1 for RTC: Current and Future – Zoe Liu

AV1 for RTC: Current and Future – Zoe Liu In this talk, we will mainly focus on the state-of-the-art AV1 software encoding capability for its deployment in RTC use cases, taking our Aurora1 AV1 as an instantiation. RTC in essence ...
RTC @Scale 2022

RTC @Scale, Video – Live Q&A

RTC @Scale, Video – Live Q&A RTC @Scale, Video Session – Live Q&A with Marco Paniconi and Zoe Liu.

Join the @Scale Mailing List and Get the Latest News & Event Info

Code of Conduct

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy