Systems @Scale Winter 2021

DECEMBER 08, 2021 @ 7:30 AM PST - 8:30 AM PST

DECEMBER 15, 2021 @ 7:30 AM PST - 9:00 AM PST

Designed for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges.

RSVPS CLOSED

AGENDA SPEAKERS

ABOUT EVENT

Systems @Scale is a technical conference for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges. The @Scale community focuses on bringing people together to discuss these challenges and collaborate on the development of new solutions.

The 2021 Winter series will be hosted virtually. Joining us are speakers from NVIDIA, and Meta (Facebook). The event spans two weeks, with talks themed around efficiency and reliability in the context of large-scale distributed systems.

Starting December 8th, for two weeks, we will livestream a recorded session followed by a live panel discussion on Wednesdays.

EVENT AGENDA

Event times below are displayed in PT.

December 8

December 15

07:30 AM - 07:50 AM

LogDevice At Scale

WATCH NOW

Meta uses a strongly consistent distributed log storage system to broadcast updates in graphs, deliver signals to ML training pipelines, and collect data for analytics. All of these cases require the underlying log system to be highly available, especially on the write side since we don't have any other place to store generated data. This talk will cover some optimizations in the consensus algorithm we are using that are required at Meta's scale to make its systems even more reliable in the presence of hardware maintenance and organic failures.

Speaker Miroslav Crnic,Meta

Speaker Nick Sukhanov,Meta

07:50 AM - 08:10 AM

Scheduled Deletions At Scale

WATCH NOW

At Meta, a large part of our data is ephemeral in nature, such as Instagram or Meta Stories which need to be deleted after a specific time regardless of the action taken by the user. This is sometimes referred to as Time to Live (TTL). Deletions may be registered to start at an arbitrary point of time in the future. Developers can achieve this at object creation time by virtue of creating an object of a type under TTL. The Deletion Infrastructure team at Meta built the Schedule Deletion (SD) Infrastructure, that allows us to Schedule and Track massive volumes of deletions from days to years in future.

Speaker Sneha Padgalwar,Meta

08:10 AM - 08:30 AM

Live Q&A Session

WATCH NOW

Live Q&A Session with Miroslav Crnic, Nick Sukhanov & Sneha Padgalwar. Hosted by Sajal Jain.

Speaker Sajal Jain,Meta

07:30 AM - 07:50 AM

SLICK: Driving SLO Culture At Meta

WATCH NOW

SLIs (Service Level Indicators) and SLOs (Service Level Objectives) are industry-standard concepts to measure the long-term reliability of systems. In this presentation, we are going to talk about SLICK, the central SLO tracking platform at Meta. We’ll highlight the challenges with reliability tracking at the company historically, and how SLICK is solving these. Afterwards, we are going to walk through SLICK’s high-level architecture, with a focus on how the major pieces are connected with the user workflows. We’ll also talk about how SLICK has introduced SLOs into the culture at Meta, and how our services have improved their reliability as a result.

Speaker Dávid Bartók,Meta

Speaker Filip Klepo,Meta

07:50 AM - 08:10 AM

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

WATCH NOW

In this talk we present how we trained a 530B parameter language model on a DGX SuperPOD with over 3,000 A100 GPUs and a high speed Infiniband interconnect, and how we can scale to even larger models. We explore three types of parallelism: data, tensor, and pipeline and how these different types can be composed to achieve maximum efficiency. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs (per-GPU throughput of 52% of theoretical peak). We discuss challenges that we faced when training the 530B Megatron-Turing NLG model and give practical advice on how to successfully train very large language models.

Speaker Jared Casper,NVIDIA

08:10 AM - 08:30 AM

Software and Hardware Remediations At Meta

WATCH NOW

Efficient software and hardware failure remediations are the foundations for sustaining high fleet availability at large-scale environments such as Meta. In this talk, we will describe the general architecture that we use to maximize fleet availability -- by identifying and remediating software issues as well as faulty assets which might require manual intervention from a datacenter technician. We’ll start by presenting FBAR (Facebook Auto-Remediation), the platform used by our main customers (IG, Messenger, etc) for writing event-based software remediation workflows that are executed to mitigate software issues. We’ll then focus on presenting what happens when issues are instead hardware related, how we detect those faults at scale and how we empowered our repair workflow with a decision engine, called RepairBrain, which helps identify and track which of the repair actions might speed up the mitigation, as well as providing a platform for deploying custom repair actions.

Speaker Antonio Davoli,Meta

Speaker Leandro Silva,Meta

08:30 AM - 09:00 AM

Live Q&A Session

WATCH NOW

Live Q&A Session with Dávid Bartók, Filip Klepo, Jared Casper, Antonio Davoli & Leandro Silva. Hosted by Ahmad Mamdouh Abdou.

Speaker Ahmad Mamdouh Abdou,Meta

SPEAKERS AND MODERATORS

Miroslav Crnic

Meta

Nick Sukhanov

Meta

Sneha Padgalwar

Meta

Sajal Jain

Meta

Dávid Bartók

Meta

Filip Klepo

Meta

Jared Casper

NVIDIA

Antonio Davoli

Meta

Leandro Silva

Meta

Ahmad Mamdouh Abdou

Meta

LATEST NOTES

@Scale engineers pencil blogs, articles, and academic papers to further inform and inspire the engineering community.

Systems & Reliability @Scale

12/08/2021

Software and Hardware Remediations at Meta

Meta large-scale deployment needs to support billions of customers who rely every day on our family of apps (Facebook, WhatsApp,...

UPCOMING EVENT | Systems and Networking

Networking 2026

August 25, 2026 Santa Clara Convention Center, Santa Clara, CA In 2026, @Scale: Networking will continue to focus on the evolution of AI Networking. To address the growing complexity of network operations, we will examine...

UPCOMING EVENT | Mobile, Video and Web

Product 2026

October 28, 2026 Meta Campus, Menlo Park, CA @Scale: Product is an exciting evolution of the @Scale conference series, uniting the best of Product, RTC, Mobile, and Video under a single AI-native theme. We are...

PAST EVENT 06/17/2026 | Data, Machine Learning and AI

AI & Data 2026

June 17, 2026 Meta Campus, Menlo Park, CA Meta’s Engineering and Infrastructure teams are excited to bring together a global contingent of engineers who are interested in building, operating, and using AI and data systems...

PAST EVENT 06/25/2026 | Systems and Networking

Systems & Reliability 2026

June 25, 2026 Meydenbauer Center, Bellevue, Washington Building the advanced infrastructure necessary to power today's sophisticated AI models represents a monumental engineering challenge. This endeavor demands the creation of highly scalable, high-performance, and supremely reliable...