Systems @Scale Winter 2021

Virtual 7:30am - 9:30am

Event Completed

Systems @Scale is a technical conference for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges. The @Scale community focuses on bringing people together to discuss these challenges and collaborate on the development of new solutions.

The 2021 Winter series will be hosted virtually. Joining us are speakers from NVIDIA, and Meta (Facebook). The event spans two weeks, with talks themed around efficiency and reliability in the context of large-scale distributed systems.

Starting December 8th, for two weeks, we will livestream a recorded session followed by a live panel discussion on Wednesdays.

Read More Read Less
Event Completed
Agenda
7:30am - 7:50am

Wednesday, December 8 — LogDevice At Scale

Meta uses a strongly consistent distributed log storage system to broadcast updates in graphs, deliver signals to ML training pipelines, and collect data for analytics. All of these cases require the underlying log system to be highly available, especially on the write side since we don't have any other place to store generated data. This talk will cover some optimizations in the consensus algorithm we are using that are required at Meta's scale to make its systems even more reliable in the presence of hardware maintenance and organic failures.
7:50am - 8:10am

Wednesday, December 8 — Scheduled Deletions At Scale

At Meta, a large part of our data is ephemeral in nature, such as Instagram or Meta Stories which need to be deleted after a specific time regardless of the action taken by the user. This is sometimes referred to as Time to Live (TTL). Deletions may be registered to start at an arbitrary point of time in the future. Developers can achieve this at object creation time by virtue of creating an object of a type under TTL. The Deletion Infrastructure team at Meta built the Schedule Deletion (SD) Infrastructure, that allows us to Schedule and Track massive volumes of deletions from days to years in future.
8:10am - 8:30am

Wednesday, December 8 — Live Q&A Session

Live Q&A Session with Miroslav Crnic, Nick Sukhanov & Sneha Padgalwar. Hosted by Sajal Jain.
7:30am - 7:50am

Wednesday, December 15 — SLICK: Driving SLO Culture At Meta

SLIs (Service Level Indicators) and SLOs (Service Level Objectives) are industry-standard concepts to measure the long-term reliability of systems. In this presentation, we are going to talk about SLICK, the central SLO tracking platform at Meta. We’ll highlight the challenges with reliability tracking at the company historically, and how SLICK is solving these. Afterwards, we are going to walk through SLICK’s high-level architecture, with a focus on how the major pieces are connected with the user workflows. We’ll also talk about how SLICK has introduced SLOs into the culture at Meta, and how our services have improved their reliability as a result.
7:50am - 8:10am

Wednesday, December 15 — Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

In this talk we present how we trained a 530B parameter language model on a DGX SuperPOD with over 3,000 A100 GPUs and a high speed Infiniband interconnect, and how we can scale to even larger models. We explore three types of parallelism: data, tensor, and pipeline and how these different types can be composed to achieve maximum efficiency. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs (per-GPU throughput of 52% of theoretical peak). We discuss challenges that we faced when training the 530B Megatron-Turing NLG model and give practical advice on how to successfully train very large language models.
8:10am - 8:30am

Wednesday, December 15 — Software and Hardware Remediations At Meta

Efficient software and hardware failure remediations are the foundations for sustaining high fleet availability at large-scale environments such as Meta. In this talk, we will describe the general architecture that we use to maximize fleet availability -- by identifying and remediating software issues as well as faulty assets which might require manual intervention from a datacenter technician. We’ll start by presenting FBAR (Facebook Auto-Remediation), the platform used by our main customers (IG, Messenger, etc) for writing event-based software remediation workflows that are executed to mitigate software issues. We’ll then focus on presenting what happens when issues are instead hardware related, how we detect those faults at scale and how we empowered our repair workflow with a decision engine, called RepairBrain, which helps identify and track which of the repair actions might speed up the mitigation, as well as providing a platform for deploying custom repair actions.
8:30am - 9:00am

Wednesday, December 15 — LIVE Q&A SESSION

Live Q&A Session with Dávid Bartók, Filip Klepo, Jared Casper, Antonio Davoli & Leandro Silva. Hosted by Ahmad Mamdouh Abdou.

Join the @Scale Mailing List and Get the Latest News & Event Info

Code of Conduct

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy