TOPIC: Data, Systems and Networking

Systems @Scale Winter 2021

DECEMBER 08, 2021 @ 7:30 AM PST - 8:30 AM PST
DECEMBER 15, 2021 @ 7:30 AM PST - 9:00 AM PST
Designed for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges.
RSVPS CLOSED
AGENDA SPEAKERS

ABOUT EVENT

Systems @Scale is a technical conference for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges. The @Scale community focuses on bringing people together to discuss these challenges and collaborate on the development of new solutions.

The 2021 Winter series will be hosted virtually. Joining us are speakers from NVIDIA, and Meta (Facebook). The event spans two weeks, with talks themed around efficiency and reliability in the context of large-scale distributed systems.

Starting December 8th, for two weeks, we will livestream a recorded session followed by a live panel discussion on Wednesdays.

EVENT AGENDA

Event times below are displayed in PT.

December 8

December 15

07:30 AM - 07:50 AM
LogDevice At Scale

Meta uses a strongly consistent distributed log storage system to broadcast updates in graphs, deliver signals to ML training pipelines, and collect data for analytics. All of these cases require the underlying log system to be highly available, especially on the write side since we don't have any other place to store generated data. This talk will cover some optimizations in the consensus algorithm we are using that are required at Meta's scale to make its systems even more reliable in the presence of hardware maintenance and organic failures.

Speaker Miroslav Crnic,Meta
Speaker Nick Sukhanov,Meta
07:50 AM - 08:10 AM
Scheduled Deletions At Scale

At Meta, a large part of our data is ephemeral in nature, such as Instagram or Meta Stories which need to be deleted after a specific time regardless of the action taken by the user. This is sometimes referred to as Time to Live (TTL). Deletions may be registered to start at an arbitrary point of time in the future. Developers can achieve this at object creation time by virtue of creating an object of a type under TTL. The Deletion Infrastructure team at Meta built the Schedule Deletion (SD) Infrastructure, that allows us to Schedule and Track massive volumes of deletions from days to years in future.

Speaker Sneha Padgalwar,Meta
08:10 AM - 08:30 AM
Live Q&A Session

Live Q&A Session with Miroslav Crnic, Nick Sukhanov & Sneha Padgalwar. Hosted by Sajal Jain.

Speaker Sajal Jain,Meta
07:30 AM - 07:50 AM
SLICK: Driving SLO Culture At Meta

SLIs (Service Level Indicators) and SLOs (Service Level Objectives) are industry-standard concepts to measure the long-term reliability of systems. In this presentation, we are going to talk about SLICK, the central SLO tracking platform at Meta. We’ll highlight the challenges with reliability tracking at the company historically, and how SLICK is solving these. Afterwards, we are going to walk through SLICK’s high-level architecture, with a focus on how the major pieces are connected with the user workflows. We’ll also talk about how SLICK has introduced SLOs into the culture at Meta, and how our services have improved their reliability as a result.

Speaker Dávid Bartók,Meta
Speaker Filip Klepo,Meta
07:50 AM - 08:10 AM
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

In this talk we present how we trained a 530B parameter language model on a DGX SuperPOD with over 3,000 A100 GPUs and a high speed Infiniband interconnect, and how we can scale to even larger models. We explore three types of parallelism: data, tensor, and pipeline and how these different types can be composed to achieve maximum efficiency. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs (per-GPU throughput of 52% of theoretical peak). We discuss challenges that we faced when training the 530B Megatron-Turing NLG model and give practical advice on how to successfully train very large language models.

Speaker Jared Casper,NVIDIA
08:10 AM - 08:30 AM
Software and Hardware Remediations At Meta

Efficient software and hardware failure remediations are the foundations for sustaining high fleet availability at large-scale environments such as Meta. In this talk, we will describe the general architecture that we use to maximize fleet availability -- by identifying and remediating software issues as well as faulty assets which might require manual intervention from a datacenter technician. We’ll start by presenting FBAR (Facebook Auto-Remediation), the platform used by our main customers (IG, Messenger, etc) for writing event-based software remediation workflows that are executed to mitigate software issues. We’ll then focus on presenting what happens when issues are instead hardware related, how we detect those faults at scale and how we empowered our repair workflow with a decision engine, called RepairBrain, which helps identify and track which of the repair actions might speed up the mitigation, as well as providing a platform for deploying custom repair actions.

Speaker Antonio Davoli,Meta
Speaker Leandro Silva,Meta
08:30 AM - 09:00 AM
Live Q&A Session

Live Q&A Session with Dávid Bartók, Filip Klepo, Jared Casper, Antonio Davoli & Leandro Silva. Hosted by Ahmad Mamdouh Abdou.

Speaker Ahmad Mamdouh Abdou,Meta

SPEAKERS AND MODERATORS

Miroslav Crnic

Meta

Nick Sukhanov

Meta

Sneha Padgalwar

Meta

Sajal Jain

Meta

Dávid Bartók

Meta

Filip Klepo

Meta

Jared Casper

NVIDIA

Antonio Davoli

Meta

Leandro Silva

Meta

Ahmad Mamdouh Abdou

Meta

LATEST NOTES

@Scale engineers pencil blogs, articles, and academic papers to further inform and inspire the engineering community.

Systems @Scale
12/08/2021
Software and Hardware Remediations at Meta
Meta large-scale deployment needs to support billions of customers who rely every day on our family of apps (Facebook, WhatsApp,...
UPCOMING EVENT   May 22, 2024 Data @Scale

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...
UPCOMING EVENT   June 12, 2024 Systems @Scale

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...
UPCOMING EVENT   07/31/2024 AI @Scale

AI Infra @Scale 2024

Meta's Engineering and Infrastructure teams are excited to host AI Infra @Scale, a one-day virtual event featuring a range of speakers from Meta who will unveil the latest AI infrastructure investments and innovations powering Meta's...
UPCOMING EVENT   August 7, 2024 Product @Scale

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. This year focuses on discussions that explore the creator ecosystem, and how AI will play a role in scaling...
UPCOMING EVENT   September 4-5, 2024 (2 day event) Networking @Scale

Networking @Scale 2024

Networking @Scale is a technical conference for engineers that build and manage large-scale networks. Meta’s Networking Infrastructure team is excited to host Networking @Scale, a two-day virtual event featuring a range of speakers from Meta...
UPCOMING EVENT   September 25, 2024 Reliability @Scale

Reliability @Scale 2024

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems and products at massive scale. Whether it’s novel design decisions, or outages that impact billions...
UPCOMING EVENT   October 23, 2024 Mobile @Scale

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...
UPCOMING EVENT   November 20, 2024 Video @Scale

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...
PAST EVENT   March 20, 2024 @ 9am PT - 3pm PT RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy