Video Player is loading.

Current Time 0:00

Duration 0:00

Loaded: 0%

Stream Type LIVE

Remaining Time 0:00

STREAMING LIVE
May 6 at 12:00am PDT

THANK YOU
FOR JOINING US!

EVENT AGENDA

Event times below are displayed in PT.

May 6

08:30 AM - 09:45 AM

Registration

08:30 AM - 09:45 AM

Breakfast, Raffle Submissions, and Networking

09:45 AM - 10:00 AM

Event Welcome

10:00 AM - 10:05 AM

Opening Remarks

10:05 AM - 10:30 AM

Keynote

Speaker Peter Hoose,Meta

Speaker Surupa Biswas,Meta

10:30 AM - 11:05 AM

Fireside Chat

Speaker Jay Parikh,Microsoft

Track 1

Track 2

11:05 AM - 11:25 AM

The Need for Speed: The Story of How We Achieved Industry Leading TTP

This presentation shares the high-stakes story of Meta's transformation from lagging behind the industry in GPU platform time-to-production (TTP) to becoming an industry leader. We'll begin by explaining the significance of TTP, providing a brief history of our New Product Introductions (NPIs), including AI platforms. A crash course on NPIs will follow, setting the stage for a deeper dive into the key lessons learned from these experiences. Next, we'll walk you through the major changes that drove our dramatic improvement in TTP and conclude with an overview of our current challenges and future work.

Speaker Richard Wareing,Meta

Speaker Tyler Graf,Meta

11:25 AM - 11:45 AM

Talk from AMD

11:45 AM - 12:45 PM

Lunch

12:45 PM - 01:10 PM

Advancing Flash Storage @ Meta

The growth of data and need for increased power efficiency are leading to innovative storage solutions. HDDs have been growing in density, but not performance, and TLC flash remains at a price point that is restrictive for scaling. QLC technology addresses these challenges by forming a middle tier between HDDs and TLC SSDs. At Meta we are deploying high density QLC racks at scale, driving innovation to overcome interesting software and hardware challenges, and promoting ecosystem alignment in this evolving storage space.

Speaker Sumit Gupta,Meta

Speaker Riley Thomasson,Pure Storage

01:10 PM - 01:40 PM

Live Q&A Session #1

01:40 PM - 01:50 PM

Challenges with Ultra-low Latency LLM Inference at Scale

In this talk, we will discuss the challenges of running ultra-low latency Large Language Model (LLM) inference at scale. We will cover the unique challenges of LLM inference, such as large model sizes, KV Caching. We will also discuss the challenges of scaling LLM inference to handle large volumes of requests, including the need for hardware, efficient scale up, and new routing architectures. Finally, we will present some of our recent work on addressing these challenges, including our development of inference infrastructure at Union.

Speaker Haytham Abuelfutuh,Union.ai

01:50 PM - 02:05 PM

Gang Scheduling for Llama

The rapid advancement of AI has necessitated a fundamental shift in infrastructure, moving from homogenous workloads that fit within a single server to multi-host workloads requiring tight container coordination across multiple servers. This talk explores the motivations and design principles behind this shift, focusing on the implementation of first-class support for gang scheduling at all layers of the system. We delve into the key components of this design, including Twine and the Resource Allowance System (RAS), and examine how they enable AI serving schemes that employ various forms of parallelism—such as pipeline, context, tensor, and expert parallelism—requiring container shared fate properties and network topology-aware allocation. By addressing these challenges, we aim to provide insights into building scalable and reliable systems that meet the demands of modern AI workloads.

Speaker Anca Agape,Meta

Speaker Andrei Darabanov,Meta

02:05 PM - 02:35 PM

Break

02:35 PM - 02:50 PM

A Planet-Scale Computer – Abstract Away Regions via Global Service Placer (GSP)

Both public clouds and our hyperscale private cloud have evolved into complex infrastructures with millions of servers spanning numerous global data center regions. Leaving users to manage the complexity of deploying global services across regions incurs significant operational toil and often leads to suboptimal outcomes. Users must select regions, align global traffic routing with service deployments and ensure disaster preparedness, while optimizing cost. They must continually repeat these processes to adapt to workload changes. To eliminate these manual burdens, we introduce the Global Service Placer (GSP), which autonomously places services across regions based on user-defined latency SLOs, abstracting away details of regions from users, such as their geo-distribution and resource availability, while optimizing efficiency. The generality and efficacy of GSP has been demonstrated via onboarding a diverse set of big and complex services. We provide a case study for highly complex AI inference workload and show significant GPU savings.

Speaker Gerald Guo,Meta

Speaker Yatpang Cheung,Meta

Speaker Yunqi Zhang,Meta

02:50 PM - 03:05 PM

Model Freshness and Its Infra Implications

Meta's recommendation systems rely on "freshness" – the speed at which user interaction signals are ingested, trained, and utilized. To improve model freshness, Meta developed solutions addressing scaling, serving footprints, and diverse architectures.

Speaker Vivek Khurana,Meta

Speaker Xianzheng Dou,Meta

Speaker Lujia Zhang,Meta

03:05 PM - 03:30 PM

Experience Operating Large GPU Clusters at Organizational Scale

We outline Nvidia's experience managing a large-scale internal GPU compute platform spanning multiple heterogeneous clusters. The platform supports thousands of users and hundreds of project accounts, handling a diverse mix of training and batch inference workloads across various research fields. We focus on three key challenges: researcher productivity, resource utilization, and operational efficiency. To improve researcher productivity, we emphasize fair scheduling and workload resilience. To keep resource utilization high, we discuss strategies to maintain high occupancy. On the operational efficiency front, we highlight our scheduler simulation capabilities, which enable safe testing of changes without affecting production workloads. The presentation concludes with key lessons learned and our vision for future improvements.

Speaker Vikas Mehta,NVIDIA

Speaker Bugra Gedik,NVIDIA

Speaker Mohamed Fawzy,NVIDIA

Speaker Vipin Sirohi,NVIDIA

11:05 AM - 11:25 AM

AI Hardware Reliability at Scale

This talk will describe our journey with AI hardware reliability (GPU/Silicon) running large scale training and inference in Meta. It will highlight our efforts across the ecosystem, covering vendor systems and our own custom silicon efforts to run AI hardware reliably at scale. For SW/Services audience, this will provide a under-the-hood look into how AI hardware reliability impacts AI applications and how Meta is driving the industry.

Speaker Sriram Sankar,Meta

Speaker Harish Dixit,Meta

11:25 AM - 11:45 AM

How We’re Scaling Discovery at Netflix Reliably

In the dynamic realm of streaming services, reliability and scalability are imperatives. This talk unveils the sophisticated architecture of Netflix's Member Discovery System, known as Mosaic, which powers key member-facing pages. Discover the innovative strategies that ensure Mosaic's robustness and reliability, and learn how Netflix sets the standard in delivering a seamless user experience to millions worldwide.

We will explore unconventional testing and deployment strategies that address the vast scale of our user base and diverse client devices, alongside sophisticated failover strategies that safeguard service continuity. Join us to uncover how Netflix maintains its competitive edge in delivering a reliable and scalable discovery system, and be inspired by the cutting-edge techniques that power Mosaic.

Speaker Karthik Puthraya,Netflix

Speaker Saurabh Jaluka,Netflix

11:45 AM - 12:45 PM

Lunch

12:45 PM - 01:10 PM

Temporary Solutions, Lasting Impact

The relentless pursuit of more AI is driving us to embrace calculated risks through innovative approaches and to boldly place hardware in “tent”-like structures. How does infrastructure tradeoff reliability for speed?

Speaker Michael Bejda,Meta

Speaker Prathyusha Peddi,Meta

01:10 PM - 01:40 PM

Live Q&A Session #2

01:40 PM - 01:50 PM

Turbocharging AI/ML workloads: Revving Up Speed and Resilience

Race cars are built for speed and resilience, equipped with cutting-edge features to reach high velocities while maintaining a firm grip on the perilous track. What if we could apply similar features to boost the speed and resilience of AI/ML jobs running over complex networking fabrics?
In this session, we’ll dive into the key networking challenges impacting AI/ML workloads, such as NIC and link flapping, network contention, and congestion. These issues not only slow down job completion times but also eat into ROI by increasing the likelihood of interruptions and costly rollbacks. I’ll demonstrate these challenges in action and introduce a solution that enhances network visibility for AI/ML jobs while ensuring smooth, uninterrupted performance even in the face of link instability or congested paths. By addressing these issues, we can optimize the efficiency of AI/ML jobs, reduce time lost to disruptions, and improve ROI by avoiding the need to revert to previous checkpoints.

Join me to learn how to supercharge your AI/ML workloads for speed and resilience — just like a race car engineered for peak performance!

Speaker Lerna Ekmekcioglu,Clockwork Systems

01:50 PM - 02:05 PM

Splitting the Monolith

For over a decade, Facebook and most other Meta products have been powered by a single monolithic PHP application. Since 2022, we have been investing in a multitenancy framework to allow product specialization with less operational overhead. This has allowed Meta to move fast in generative AI, by providing a familiar development environment for our engineers with optimizations for inference workloads.

Speaker Phil Lopreiato,Meta

Speaker Zach Zundel,Meta

02:05 PM - 02:35 PM

Break

02:35 PM - 02:50 PM

Building KVStore for ML Workloads at Pinterest

This presentation introduces Pinterest's KVStore, a distributed key-value store designed to support machine learning workloads that are central to Pinterest's functions. KVStore enables efficient low-latency ML feature serving with various data update methods. KVStore is crucial for Pinterest's AI/ML-driven platform, evolving to meet business needs with a focus on reliability and efficiency at scale.

Speaker Jia Zhan,Pinterest

02:50 PM - 03:05 PM

Journey to 1000 Models: Scaling Instagram's algorithm without the Reliability Nightmare

At the beginning of 2023, Instagram had O(10) gpu models, a manual release process, and a manual monitoring setup. This talk will be centered around our journey to 1000 models: the bumps along the road and the foundational work built to make monitoring model health faster and more accurate. We’ll be going over model registry, the model launch process, and model stability.

Speaker Sing Sing Ma,Meta

Speaker Luke Levis,Meta

03:05 PM - 03:30 PM

Product Reliability in Google Maps

While our organization excelled at maintaining server SLOs for Google Maps, we discovered that many user-impacting incidents, particularly those stemming from client-side issues like mobile app rollouts, remained undetected by server-centric monitoring. This realization prompted a strategic shift towards product reliability, prioritizing the end-user experience. This talk will discuss how we navigated this transition, sharing our progress in addressing challenges, the valuable lessons learned, and our evolving vision for a holistic, user-focused reliability strategy.

Speaker Micah Lerner,Google

03:30 PM - 04:00 PM

Table Talks

04:00 PM - 05:00 PM

Happy Hour & Networking

EVENT AGENDA

May 6

May 6

May 6

Track 1

Track 2

Track 1

Track 1

Track 2

LATEST NOTES