Systems @Scale 2024

June 12, 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering challenges. The @Scale community focuses on bringing people's experiences in the creation of innovative solutions.

On June 12, Systems @Scale will bring together speakers from Bytedance, Lepton AI, Meta, NVIDIA and Snowflake to discuss infrastructure support for AI, system efficiency and reliability, distributed system abstractions, hyperscale orchestration, and more.

In person registration is now closed. Registration for joining virtually is still open and will remain open through event day.

RSVPS CLOSED

AGENDA SPEAKERS

EVENT AGENDA

Event times below are displayed in PT.

June 12

10:00 AM - 10:05 AM

Opening Remarks

Speaker Alex Boyko,META

10:05 AM - 10:25 AM

Keynote

WATCH NOW

The AI revolution has created an exciting period of innovation for Infrastructure people. It’s a time where new methodologies and system architectures are being formed. In this keynote, Surupa Biswas, VP Engineering at Meta, covers Meta’s in-progress journey evolving their core infrastructure systems at an unprecedented pace in support of AI.

Speaker Surupa Biswas,Meta

10:25 AM - 10:45 AM

Building at Scale with H100: Eos as a DGX SuperPOD Reference Model for Large Data Center Builds

WATCH NOW

With language models getting larger, building compute infrastructure needs to handle both reliability and performance at unprecedented scales. In addition to having a large number of GPUs working together, the platform needs to provide guarantees on fabric and IO performance and stability, but also ensure software is architected to enable consistency and reliability from workload launching, job scheduling, and monitoring. In this talk, we will describe how Eos was built to leverage a H100 reference cluster architecture.

Speaker Julie Bernauer,NVIDIA

10:45 AM - 11:05 AM

Maintaining Large Scale AI Capacity @Meta

WATCH NOW

In just two years, Meta has undergone a monumental transformation in its AI infrastructure, transitioning from a single research cluster to a sprawling network of nearly hundred AI super clusters of varying sizes with hundreds of thousands of GPUs. This rapid expansion has introduced a myriad of challenges, ranging from managing diverse hardware configurations to optimizing resource allocations. As part of this, we scaled our infrastructure to safely and predictably perform maintenance without disrupting the training jobs. As an example, our teams worked with teams at NVIDIA to create advancements in the GPU stack in the form of deep health checks, allowing us to rollout critical upgrades continuously.

During this talk, we will share insights into the key areas that demanded our attention and the solutions we implemented to address them. From implementing rolling updates for kernel drivers and firmware to leveraging redundancies and failover mechanisms, we will explore the technical intricacies involved in sustaining our AI infrastructure while conducting essential maintenance tasks. Furthermore, we will discuss the role of automation and orchestration in streamlining maintenance operations, minimizing downtime, and optimizing resource utilization.

Speaker Saranyan Vigraham,META

Speaker Benjamin Leonhardi,META

ADDITIONAL RESOURCES

MAINTAINING LARGE-SCALE AI CAPACITY AT META read more

11:05 AM - 11:20 AM

Systems @Scale Live Q&A #1

Speaker Julie Bernauer,NVIDIA

Speaker Saranyan Vigraham,META

Speaker Benjamin Leonhardi,META

Moderator Alex Boyko,META

11:20 AM - 11:40 AM

Break

11:40 AM - 12:00 PM

Training LLaMa: A Storage Perspective

WATCH NOW

GenAI training needs flipped the script of all of our assumptions around "storage at scale". This is the story of our trials and tribulations that ultimately led to the successful launch of our largest scale LLaMA training jobs, from a Storage perspective.

Speaker Robin Battey,Meta

Speaker Sumit Gupta,Meta

12:00 PM - 12:20 PM

Training Arctic at Snowflake

WATCH NOW

In this case study, we present the system used to train the Arctic MoE model at Snowflake. The system uses a combination of Snowflake and Kubernetes for the entire lifecycle of Large Language Model (LLM) training, ranging from the initial stages of data acquisition and processing—including annotation, filtering, and deduplication—to conducting data ablation experiments and executing large-scale model training. Our approach leverages Snowflake for its robust data governance, lineage tracking, and cloud warehouse capabilities, alongside the versatile CPU and GPU compute resources orchestrated through Kubernetes. This symbiosis not only streamlines the model development process but also enhances efficiency and scalability by optimizing resource allocation and utilization: a cluster of GPU nodes and a Snowflake instance is all you need to do model training from scratch. Through this unified framework, we demonstrate a seamless, end-to-end solution that accelerates LLM training workflows, ensuring both high performance and adherence to data governance standards.

Live remarks will be presented by Jeff Rasley and Lawrence Moore. The post event video on demand will feature Jeff Rasley and Hyungtae Kim.

Speaker Lawrence Moore,Snowflake

Speaker Jeff Rasley,Snowflake

Speaker Hyungtae Kim,Snowflake

12:20 PM - 12:35 PM

Systems @Scale Live Q&A #2

Speaker Robin Battey,Meta

Speaker Sumit Gupta,Meta

Speaker Lawrence Moore,Snowflake

Speaker Jeff Rasley,Snowflake

Moderator Alex Boyko,META

12:35 PM - 01:40 PM

Lunch Break

01:40 PM - 02:00 PM

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

WATCH NOW

In this presentation, I will discuss the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We developed a set of diagnostic tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. We share our operational experience in identifying and fixing failures and stragglers.

Speaker Haibin Lin,Bytedance

02:00 PM - 02:15 PM

AI Training Orchestration Evolution with Serverless Building Blocks

WATCH NOW

Join us as we talk about the evolution of workflow orchestration leading to the creation of composable serverless subsystems.We further discuss how Fblearner, an AI development platform, leveraged this building blocks ecosystem to address persistent challenges like orchestration-execution coupling, inefficient resource use, and poor debugging experiences. We will also delve into the complexities of updating a business-critical system with strict SLA guarantees at Meta scale.

Speaker Maneet Bansal,META

Speaker Upasana Dixit,META

Speaker Shawn Wang,Meta

Featured Article

EVOLUTION OF AI TRAINING ORCHESTRATION WITH SERVERLESS ECOSYSTEM read more

02:15 PM - 02:30 PM

Evolving Cluster Management

WATCH NOW

We will talk about the next evolution of cluster management, specifically focusing on up-leveled paradigms and how they have improved integration with higher level systems and reduced operational complexity.

Speaker Shankar Selvam,META

Speaker Cedric Goh,META

Featured Article

EVOLVING CLUSTER MANAGEMENT: UPLEVELING ABSTRACTIONS read more

02:30 PM - 02:45 PM

Systems @Scale Live Q&A #3

Speaker Haibin Lin,Bytedance

Speaker Maneet Bansal,META

Speaker Upasana Dixit,META

Speaker Shawn Wang,Meta

Speaker Shankar Selvam,META

Speaker Cedric Goh,META

Moderator Maria Barra,META

02:45 PM - 03:05 PM

Break

03:05 PM - 03:25 PM

Scalable Solutions for Running Large Language Models

WATCH NOW

The advent of open-source large language models like Llama and Mixtral demands innovative deployment strategies for efficiency and cost-effectiveness. We will explore adaptive workload management for infrastructure optimization, crucial for handling varying demands efficiently. Next, we will delve into LLM caching techniques, including sticky routing and prompt caching, to enhance response times and optimize system utilization. Additionally, we'll discuss strategies designed to mitigate system pressure during spikes in traffic. These strategies collectively aim to enhance the scalability and efficiency of AI platforms in the era of advanced LLMs.

Speaker Jiaxin Cao,Lepton AI

03:25 PM - 03:45 PM

GenAI Training in Production: Software, Hardware & Network Considerations

WATCH NOW

The impact of GenAI on Infrastructure has been swift and profound across the industry. In this talk we will outline how Meta built GenAI infrastructure and discuss the challenges and tradeoffs made across hardware, network and software and maintain operations at scale. We will also discuss some lessons learned along the way and opportunities that lie ahead.

Speaker Adi Gangidi,Meta

Speaker Jenya Lee,Meta

Speaker KR Kishore,META

03:45 PM - 04:05 PM

SYSTEMS @SCALE LIVE Q&A #4

Speaker Jiaxin Cao,Lepton AI

Speaker Adi Gangidi,Meta

Speaker Jenya Lee,Meta

Speaker KR Kishore,META

Moderator Maria Barra,META

04:05 PM - 04:10 PM

Closing Remarks

Moderator Alex Boyko,META

SPEAKERS AND MODERATORS

Alex is an engineering leader at Core Systems, where he supports teams responsible for... read more

Alex Boyko

META

Surupa Biswas is the Vice President of Engineering responsible for Core Infrastructure at Meta,... read more

Surupa Biswas

Meta

As senior director for Data Center Systems Engineering at NVIDIA, Julie leads a team... read more

Julie Bernauer

NVIDIA

Saranyan Vigraham is a seasoned leader at Meta, where he heads the Data Center... read more

Saranyan Vigraham

META

Benjamin Leonhardi leads a wide array of problems in the space of AI maintenance... read more

Benjamin Leonhardi

META

Robin has been designing and building distributed systems since 2003, and for Meta since... read more

Robin Battey

Meta

Sumit has been in the storage industry for 30 years and has been in... read more

Sumit Gupta

Meta

Lawrence is a senior software engineer working on enabling LLMs for the enterprise. He... read more

Lawrence Moore

Snowflake

Jeff Rasley is a Senior Engineer in the Snowflake AI Research Team working on... read more

Jeff Rasley

Snowflake

Hyungtae is a principal software engineer on the Arctic research team at Snowflake. He... read more

Hyungtae Kim

Snowflake

Haibin is a research scientist and engineering leader at Bytedance, working on machine learning... read more

Haibin Lin

Bytedance

Maneet Bansal is a seasoned software engineer on the Core Systems Team at Meta,... read more

Maneet Bansal

META

Upasana is a Software Engineer in the Serverless Orchestration team at Meta, where she... read more

Upasana Dixit

META

Shawn Wang is a software engineer with years of experience in building large-scale distributed... read more

Shawn Wang

Meta

Shankar has been working as a Software Engineer on Meta’s cluster management and container... read more

Shankar Selvam

META

Cedric Goh is a Software Engineer in the Core Systems team, working on Twine,... read more

Cedric Goh

META

Maria has been a Technical Program Manager within Meta's extensive Infrastructure Organization for over... read more

Maria Barra

META

Machine learning expert. Previously worked for Bytedance & Alibaba. Focus on LLM/Diffusion. read more

Jiaxin Cao

Lepton AI

Adi is a Hardware Systems Engineer at Meta. read more

Adi Gangidi

Meta

Jenya Lee is a Production Engineer at Meta, playing a pivotal role in the... read more

Jenya Lee

Meta

Kishore has been a Hardware Systems Engineer at Meta for the past 4+ years... read more

KR Kishore

META

LATEST NOTES

Systems & Reliability @Scale

06/10/2024

Evolving Cluster Management: Upleveling Abstractions

At Meta, our vast infrastructure spans over 20 data center regions and comprises millions of machines, all of which work...

Systems & Reliability @Scale

06/11/2024

Evolution of AI Training Orchestration with Serverless Ecosystem

Introduction The past couple of years have been nothing short of extraordinary for technology, especially for artificial intelligence (AI). Amidst...

UPCOMING EVENT | Systems and Networking

Networking 2026

August 25, 2026 Santa Clara Convention Center, Santa Clara, CA In 2026, @Scale: Networking will continue to focus on the evolution of AI Networking. To address the growing complexity of network operations, we will examine...

UPCOMING EVENT | Mobile, Video and Web

Product 2026

October 28, 2026 Meta Campus, Menlo Park, CA @Scale: Product is an exciting evolution of the @Scale conference series, uniting the best of Product, RTC, Mobile, and Video under a single AI-native theme. We are...

PAST EVENT 06/17/2026 | Data, Machine Learning and AI

AI & Data 2026

June 17, 2026 Meta Campus, Menlo Park, CA Meta’s Engineering and Infrastructure teams are excited to bring together a global contingent of engineers who are interested in building, operating, and using AI and data systems...

PAST EVENT 06/25/2026 | Systems and Networking

Systems & Reliability 2026

June 25, 2026 Meydenbauer Center, Bellevue, Washington Building the advanced infrastructure necessary to power today's sophisticated AI models represents a monumental engineering challenge. This endeavor demands the creation of highly scalable, high-performance, and supremely reliable...