TOPIC: Data, Systems and Networking

Systems @Scale 2024

June 12, 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering challenges. The @Scale community focuses on bringing people's experiences in the creation of innovative solutions.

On June 12, Systems @Scale will bring together speakers from Bytedance, Lepton AI, Meta, NVIDIA and Snowflake to discuss infrastructure support for AI, system efficiency and reliability, distributed system abstractions, hyperscale orchestration, and more.

In person registration is now closed. Registration for joining virtually is still open and will remain open through event day.

RSVPS CLOSED
AGENDA SPEAKERS

EVENT AGENDA

Event times below are displayed in PT.

June 12

10:00 AM - 10:05 AM
Opening Remarks
Speaker Alex Boyko,META
10:05 AM - 10:25 AM
Keynote

The AI revolution has created an exciting period of innovation for Infrastructure people. It’s a time where new methodologies and system architectures are being formed. In this keynote, Surupa Biswas, VP Engineering at Meta, covers Meta’s in-progress journey evolving their core infrastructure systems at an unprecedented pace in support of AI.

Speaker Surupa Biswas,Meta
10:25 AM - 10:45 AM
Building at Scale with H100: Eos as a DGX SuperPOD Reference Model for Large Data Center Builds

With language models getting larger, building compute infrastructure needs to handle both reliability and performance at unprecedented scales. In addition to having a large number of GPUs working together, the platform needs to provide guarantees on fabric and IO performance and stability, but also ensure software is architected to enable consistency and reliability from workload launching, job scheduling, and monitoring. In this talk, we will describe how Eos was built to leverage a H100 reference cluster architecture.

Speaker Julie Bernauer,NVIDIA
10:45 AM - 11:05 AM
Maintaining Large Scale AI Capacity @Meta

In just two years, Meta has undergone a monumental transformation in its AI infrastructure, transitioning from a single research cluster to a sprawling network of nearly hundred AI super clusters of varying sizes with hundreds of thousands of GPUs. This rapid expansion has introduced a myriad of challenges, ranging from managing diverse hardware configurations to optimizing resource allocations. As part of this, we scaled our infrastructure to safely and predictably perform maintenance without disrupting the training jobs. As an example, our teams worked with teams at NVIDIA to create advancements in the GPU stack in the form of deep health checks, allowing us to rollout critical upgrades continuously.

During this talk, we will share insights into the key areas that demanded our attention and the solutions we implemented to address them. From implementing rolling updates for kernel drivers and firmware to leveraging redundancies and failover mechanisms, we will explore the technical intricacies involved in sustaining our AI infrastructure while conducting essential maintenance tasks. Furthermore, we will discuss the role of automation and orchestration in streamlining maintenance operations, minimizing downtime, and optimizing resource utilization.

Speaker Saranyan Vigraham,META
Speaker Benjamin Leonhardi,META
11:05 AM - 11:20 AM
Systems @Scale Live Q&A #1
Speaker Julie Bernauer,NVIDIA
Speaker Saranyan Vigraham,META
Speaker Benjamin Leonhardi,META
Moderator Alex Boyko,META
11:20 AM - 11:40 AM
Break
11:40 AM - 12:00 PM
Training LLaMa: A Storage Perspective

GenAI training needs flipped the script of all of our assumptions around "storage at scale". This is the story of our trials and tribulations that ultimately led to the successful launch of our largest scale LLaMA training jobs, from a Storage perspective.

Speaker Robin Battey,Meta
Speaker Sumit Gupta,Meta
12:00 PM - 12:20 PM
Training Arctic at Snowflake

In this case study, we present the system used to train the Arctic MoE model at Snowflake. The system uses a combination of Snowflake and Kubernetes for the entire lifecycle of Large Language Model (LLM) training, ranging from the initial stages of data acquisition and processing—including annotation, filtering, and deduplication—to conducting data ablation experiments and executing large-scale model training. Our approach leverages Snowflake for its robust data governance, lineage tracking, and cloud warehouse capabilities, alongside the versatile CPU and GPU compute resources orchestrated through Kubernetes. This symbiosis not only streamlines the model development process but also enhances efficiency and scalability by optimizing resource allocation and utilization: a cluster of GPU nodes and a Snowflake instance is all you need to do model training from scratch. Through this unified framework, we demonstrate a seamless, end-to-end solution that accelerates LLM training workflows, ensuring both high performance and adherence to data governance standards.

Live remarks will be presented by Jeff Rasley and Lawrence Moore. The post event video on demand will feature Jeff Rasley and Hyungtae Kim.

Speaker Lawrence Moore,Snowflake
Speaker Jeff Rasley,Snowflake
Speaker Hyungtae Kim,Snowflake
12:20 PM - 12:35 PM
Systems @Scale Live Q&A #2
Speaker Robin Battey,Meta
Speaker Sumit Gupta,Meta
Speaker Lawrence Moore,Snowflake
Speaker Jeff Rasley,Snowflake
Moderator Alex Boyko,META
12:35 PM - 01:40 PM
Lunch Break
01:40 PM - 02:00 PM
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

In this presentation, I will discuss the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We developed a set of diagnostic tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. We share our operational experience in identifying and fixing failures and stragglers.

Speaker Haibin Lin,Bytedance
02:00 PM - 02:15 PM
AI Training Orchestration Evolution with Serverless Building Blocks

Join us as we talk about the evolution of workflow orchestration leading to the creation of composable serverless subsystems.We further discuss how Fblearner, an AI development platform, leveraged this building blocks ecosystem to address persistent challenges like orchestration-execution coupling, inefficient resource use, and poor debugging experiences. We will also delve into the complexities of updating a business-critical system with strict SLA guarantees at Meta scale.

Speaker Maneet Bansal,META
Speaker Upasana Dixit,META
Speaker Shawn Wang,Meta
Featured Article
EVOLUTION OF AI TRAINING ORCHESTRATION WITH SERVERLESS ECOSYSTEM  read more
02:15 PM - 02:30 PM
Evolving Cluster Management

We will talk about the next evolution of cluster management, specifically focusing on up-leveled paradigms and how they have improved integration with higher level systems and reduced operational complexity.

Speaker Shankar Selvam,META
Speaker Cedric Goh,META
Featured Article
EVOLVING CLUSTER MANAGEMENT: UPLEVELING ABSTRACTIONS  read more
02:30 PM - 02:45 PM
Systems @Scale Live Q&A #3
Speaker Haibin Lin,Bytedance
Speaker Maneet Bansal,META
Speaker Upasana Dixit,META
Speaker Shawn Wang,Meta
Speaker Shankar Selvam,META
Speaker Cedric Goh,META
Moderator Maria Barra,META
02:45 PM - 03:05 PM
Break
03:05 PM - 03:25 PM
Scalable Solutions for Running Large Language Models

The advent of open-source large language models like Llama and Mixtral demands innovative deployment strategies for efficiency and cost-effectiveness. We will explore adaptive workload management for infrastructure optimization, crucial for handling varying demands efficiently. Next, we will delve into LLM caching techniques, including sticky routing and prompt caching, to enhance response times and optimize system utilization. Additionally, we'll discuss strategies designed to mitigate system pressure during spikes in traffic. These strategies collectively aim to enhance the scalability and efficiency of AI platforms in the era of advanced LLMs.

Speaker Jiaxin Cao,Lepton AI
03:25 PM - 03:45 PM
GenAI Training in Production: Software, Hardware & Network Considerations

The impact of GenAI on Infrastructure has been swift and profound across the industry. In this talk we will outline how Meta built GenAI infrastructure and discuss the challenges and tradeoffs made across hardware, network and software and maintain operations at scale. We will also discuss some lessons learned along the way and opportunities that lie ahead.

Speaker Adi Gangidi,Meta
Speaker Jenya Lee,Meta
Speaker KR Kishore,META
03:45 PM - 04:05 PM
SYSTEMS @SCALE LIVE Q&A #4
Speaker Jiaxin Cao,Lepton AI
Speaker Adi Gangidi,Meta
Speaker Jenya Lee,Meta
Speaker KR Kishore,META
Moderator Maria Barra,META
04:05 PM - 04:10 PM
Closing Remarks
Moderator Alex Boyko,META

SPEAKERS AND MODERATORS

Alex is an engineering leader at Core Systems, where he supports teams responsible for... read more

Alex Boyko

META

Surupa Biswas is the Vice President of Engineering responsible for Core Systems at Meta.... read more

Surupa Biswas

Meta

As senior director for Data Center Systems Engineering at NVIDIA, Julie leads a team... read more

Julie Bernauer

NVIDIA

Saranyan Vigraham is a seasoned leader at Meta, where he heads the Data Center... read more

Saranyan Vigraham

META

Benjamin Leonhardi leads a wide array of problems in the space of AI maintenance... read more

Benjamin Leonhardi

META

Robin has been designing and building distributed systems since 2003, and for Meta since... read more

Robin Battey

Meta

Sumit has been in the storage industry for 30 years and has worked at... read more

Sumit Gupta

Meta

Lawrence is a senior software engineer working on enabling LLMs for the enterprise. He... read more

Lawrence Moore

Snowflake

Jeff Rasley is a Senior Engineer in the Snowflake AI Research Team working on... read more

Jeff Rasley

Snowflake

Hyungtae is a principal software engineer on the Arctic research team at Snowflake. He... read more

Hyungtae Kim

Snowflake

Haibin is a research scientist and engineering leader at Bytedance, working on machine learning... read more

Haibin Lin

Bytedance

Maneet Bansal is a seasoned software engineer on the Core Systems Team at Meta,... read more

Maneet Bansal

META

Upasana is a Software Engineer in the Serverless Orchestration team at Meta, where she... read more

Upasana Dixit

META

Shawn Wang is a software engineer with years of experience in building large-scale distributed... read more

Shawn Wang

Meta

Shankar has been working as a Software Engineer on Meta’s cluster management and container... read more

Shankar Selvam

META

Cedric Goh is a Software Engineer in the Core Systems team, working on Twine,... read more

Cedric Goh

META

Maria has been a Technical Program Manager within Meta's extensive Infrastructure Organization for over... read more

Maria Barra

META

Machine learning expert. Previously worked for Bytedance & Alibaba. Focus on LLM/Diffusion. read more

Jiaxin Cao

Lepton AI

At Meta, I lead RDMA Network design and deployments for AI workloads. Before this,... read more

Adi Gangidi

Meta

Jenya Lee is a Production Engineer at Meta, playing a pivotal role in the... read more

Jenya Lee

Meta

Kishore has been a Hardware Systems Engineer at Meta for the past 4+ years... read more

KR Kishore

META

LATEST NOTES

Systems @Scale
06/10/2024
Evolving Cluster Management: Upleveling Abstractions 
At Meta, our vast infrastructure spans over 20 data center regions and comprises millions of machines, all of which work...
Systems @Scale
06/11/2024
Evolution of AI Training Orchestration with Serverless Ecosystem
Introduction The past couple of years have been nothing short of extraordinary for technology, especially for artificial intelligence (AI). Amidst...
UPCOMING EVENT   07/31/2024 AI @Scale

AI Infra @Scale 2024

Meta's Engineering and Infrastructure teams are excited to host AI Infra @Scale, a one-day virtual event featuring a range of speakers from Meta who will unveil the latest AI infrastructure investments and innovations powering Meta's...
UPCOMING EVENT   August 7, 2024 Product @Scale

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. The @Scale community focuses on bringing forward people's experiences in creating innovative solutions to large-scale products serving millions or...
UPCOMING EVENT   September 4-5, 2024 (2 day event) Networking @Scale

Networking @Scale 2024

Networking @Scale is a technical conference for engineers that build and manage large-scale networks. Meta’s Networking Infrastructure team is excited to host Networking @Scale, a two-day virtual event featuring a range of speakers from Meta...
UPCOMING EVENT   October 9, 2024 Reliability @Scale

Reliability @Scale 2024

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems and products at massive scale. Whether it’s novel design decisions, or outages that impact billions...
UPCOMING EVENT   October 23, 2024 Mobile @Scale

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...
UPCOMING EVENT   November 20, 2024 Video @Scale

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...
PAST EVENT   March 20, 2024 @ 9am PT - 3pm PT RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...
Past EVENT   May 22, 2024 Data @Scale

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...
Past EVENT   June 12, 2024 Systems @Scale

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy