EVENT AGENDA
Event times below are displayed in PT.
Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering challenges. The @Scale community focuses on bringing people's experiences in the creation of innovative solutions.
On June 12, Systems @Scale will bring together speakers from Bytedance, Lepton AI, Meta, NVIDIA and Snowflake to discuss infrastructure support for AI, system efficiency and reliability, distributed system abstractions, hyperscale orchestration, and more.
In person registration is now closed. Registration for joining virtually is still open and will remain open through event day.
Event times below are displayed in PT.
The AI revolution has created an exciting period of innovation for Infrastructure people. It’s a time where new methodologies and system architectures are being formed. In this keynote, Surupa Biswas, VP Engineering at Meta, covers Meta’s in-progress journey evolving their core infrastructure systems at an unprecedented pace in support of AI.
With language models getting larger, building compute infrastructure needs to handle both reliability and performance at unprecedented scales. In addition to having a large number of GPUs working together, the platform needs to provide guarantees on fabric and IO performance and stability, but also ensure software is architected to enable consistency and reliability from workload launching, job scheduling, and monitoring. In this talk, we will describe how Eos was built to leverage a H100 reference cluster architecture.
In just two years, Meta has undergone a monumental transformation in its AI infrastructure, transitioning from a single research cluster to a sprawling network of nearly hundred AI super clusters of varying sizes with hundreds of thousands of GPUs. This rapid expansion has introduced a myriad of challenges, ranging from managing diverse hardware configurations to optimizing resource allocations. As part of this, we scaled our infrastructure to safely and predictably perform maintenance without disrupting the training jobs. As an example, our teams worked with teams at NVIDIA to create advancements in the GPU stack in the form of deep health checks, allowing us to rollout critical upgrades continuously.
During this talk, we will share insights into the key areas that demanded our attention and the solutions we implemented to address them. From implementing rolling updates for kernel drivers and firmware to leveraging redundancies and failover mechanisms, we will explore the technical intricacies involved in sustaining our AI infrastructure while conducting essential maintenance tasks. Furthermore, we will discuss the role of automation and orchestration in streamlining maintenance operations, minimizing downtime, and optimizing resource utilization.
GenAI training needs flipped the script of all of our assumptions around "storage at scale". This is the story of our trials and tribulations that ultimately led to the successful launch of our largest scale LLaMA training jobs, from a Storage perspective.
In this case study, we present the system used to train the Arctic MoE model at Snowflake. The system uses a combination of Snowflake and Kubernetes for the entire lifecycle of Large Language Model (LLM) training, ranging from the initial stages of data acquisition and processing—including annotation, filtering, and deduplication—to conducting data ablation experiments and executing large-scale model training. Our approach leverages Snowflake for its robust data governance, lineage tracking, and cloud warehouse capabilities, alongside the versatile CPU and GPU compute resources orchestrated through Kubernetes. This symbiosis not only streamlines the model development process but also enhances efficiency and scalability by optimizing resource allocation and utilization: a cluster of GPU nodes and a Snowflake instance is all you need to do model training from scratch. Through this unified framework, we demonstrate a seamless, end-to-end solution that accelerates LLM training workflows, ensuring both high performance and adherence to data governance standards.
Live remarks will be presented by Jeff Rasley and Lawrence Moore. The post event video on demand will feature Jeff Rasley and Hyungtae Kim.
In this presentation, I will discuss the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We developed a set of diagnostic tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. We share our operational experience in identifying and fixing failures and stragglers.
Join us as we talk about the evolution of workflow orchestration leading to the creation of composable serverless subsystems.We further discuss how Fblearner, an AI development platform, leveraged this building blocks ecosystem to address persistent challenges like orchestration-execution coupling, inefficient resource use, and poor debugging experiences. We will also delve into the complexities of updating a business-critical system with strict SLA guarantees at Meta scale.
We will talk about the next evolution of cluster management, specifically focusing on up-leveled paradigms and how they have improved integration with higher level systems and reduced operational complexity.
The advent of open-source large language models like Llama and Mixtral demands innovative deployment strategies for efficiency and cost-effectiveness. We will explore adaptive workload management for infrastructure optimization, crucial for handling varying demands efficiently. Next, we will delve into LLM caching techniques, including sticky routing and prompt caching, to enhance response times and optimize system utilization. Additionally, we'll discuss strategies designed to mitigate system pressure during spikes in traffic. These strategies collectively aim to enhance the scalability and efficiency of AI platforms in the era of advanced LLMs.
The impact of GenAI on Infrastructure has been swift and profound across the industry. In this talk we will outline how Meta built GenAI infrastructure and discuss the challenges and tradeoffs made across hardware, network and software and maintain operations at scale. We will also discuss some lessons learned along the way and opportunities that lie ahead.
Alex is an engineering leader at Core Systems, where he supports teams responsible for... read more
Surupa Biswas is the Vice President of Engineering responsible for Core Systems at Meta.... read more
As senior director for Data Center Systems Engineering at NVIDIA, Julie leads a team... read more
Saranyan Vigraham is a seasoned leader at Meta, where he heads the Data Center... read more
Benjamin Leonhardi leads a wide array of problems in the space of AI maintenance... read more
Robin has been designing and building distributed systems since 2003, and for Meta since... read more
Sumit has been in the storage industry for 30 years and has worked at... read more
Lawrence is a senior software engineer working on enabling LLMs for the enterprise. He... read more
Jeff Rasley is a Senior Engineer in the Snowflake AI Research Team working on... read more
Hyungtae is a principal software engineer on the Arctic research team at Snowflake. He... read more
Haibin is a research scientist and engineering leader at Bytedance, working on machine learning... read more
Maneet Bansal is a seasoned software engineer on the Core Systems Team at Meta,... read more
Upasana is a Software Engineer in the Serverless Orchestration team at Meta, where she... read more
Shawn Wang is a software engineer with years of experience in building large-scale distributed... read more
Shankar has been working as a Software Engineer on Meta’s cluster management and container... read more
Cedric Goh is a Software Engineer in the Core Systems team, working on Twine,... read more
Maria has been a Technical Program Manager within Meta's extensive Infrastructure Organization for over... read more
Machine learning expert. Previously worked for Bytedance & Alibaba. Focus on LLM/Diffusion. read more
Adi is a Hardware Systems Engineer at Meta. read more
Jenya Lee is a Production Engineer at Meta, playing a pivotal role in the... read more
Kishore has been a Hardware Systems Engineer at Meta for the past 4+ years... read more