EVENT AGENDA
Event times below are displayed in PT.
Event times below are displayed in PT.
This presentation shares the high-stakes story of Meta's transformation from lagging behind the industry in GPU platform time-to-production (TTP) to becoming an industry leader. We'll begin by explaining the significance of TTP, providing a brief history of our New Product Introductions (NPIs), including AI platforms. A crash course on NPIs will follow, setting the stage for a deeper dive into the key lessons learned from these experiences. Next, we'll walk you through the major changes that drove our dramatic improvement in TTP and conclude with an overview of our current challenges and future work.
The growth of data and need for increased power efficiency are leading to innovative storage solutions. HDDs have been growing in density, but not performance, and TLC flash remains at a price point that is restrictive for scaling. QLC technology addresses these challenges by forming a middle tier between HDDs and TLC SSDs. At Meta we are deploying high density QLC racks at scale, driving innovation to overcome interesting software and hardware challenges, and promoting ecosystem alignment in this evolving storage space.
In this talk, we will discuss the challenges of running ultra-low latency Large Language Model (LLM) inference at scale. We will cover the unique challenges of LLM inference, such as large model sizes, KV Caching. We will also discuss the challenges of scaling LLM inference to handle large volumes of requests, including the need for hardware, efficient scale up, and new routing architectures. Finally, we will present some of our recent work on addressing these challenges, including our development of inference infrastructure at Union.
The rapid advancement of AI has necessitated a fundamental shift in infrastructure, moving from homogenous workloads that fit within a single server to multi-host workloads requiring tight container coordination across multiple servers. This talk explores the motivations and design principles behind this shift, focusing on the implementation of first-class support for gang scheduling at all layers of the system. We delve into the key components of this design, including Twine and the Resource Allowance System (RAS), and examine how they enable AI serving schemes that employ various forms of parallelism—such as pipeline, context, tensor, and expert parallelism—requiring container shared fate properties and network topology-aware allocation. By addressing these challenges, we aim to provide insights into building scalable and reliable systems that meet the demands of modern AI workloads.
Both public clouds and our hyperscale private cloud have evolved into complex infrastructures with millions of servers spanning numerous global data center regions. Leaving users to manage the complexity of deploying global services across regions incurs significant operational toil and often leads to suboptimal outcomes. Users must select regions, align global traffic routing with service deployments and ensure disaster preparedness, while optimizing cost. They must continually repeat these processes to adapt to workload changes. To eliminate these manual burdens, we introduce the Global Service Placer (GSP), which autonomously places services across regions based on user-defined latency SLOs, abstracting away details of regions from users, such as their geo-distribution and resource availability, while optimizing efficiency. The generality and efficacy of GSP has been demonstrated via onboarding a diverse set of big and complex services. We provide a case study for highly complex AI inference workload and show significant GPU savings.
Meta's recommendation systems rely on "freshness" – the speed at which user interaction signals are ingested, trained, and utilized. To improve model freshness, Meta developed solutions addressing scaling, serving footprints, and diverse architectures.
We outline Nvidia's experience managing a large-scale internal GPU compute platform spanning multiple heterogeneous clusters. The platform supports thousands of users and hundreds of project accounts, handling a diverse mix of training and batch inference workloads across various research fields. We focus on three key challenges: researcher productivity, resource utilization, and operational efficiency. To improve researcher productivity, we emphasize fair scheduling and workload resilience. To keep resource utilization high, we discuss strategies to maintain high occupancy. On the operational efficiency front, we highlight our scheduler simulation capabilities, which enable safe testing of changes without affecting production workloads. The presentation concludes with key lessons learned and our vision for future improvements.
This talk will describe our journey with AI hardware reliability (GPU/Silicon) running large scale training and inference in Meta. It will highlight our efforts across the ecosystem, covering vendor systems and our own custom silicon efforts to run AI hardware reliably at scale. For SW/Services audience, this will provide a under-the-hood look into how AI hardware reliability impacts AI applications and how Meta is driving the industry.
In the dynamic realm of streaming services, reliability and scalability are imperatives. This talk unveils the sophisticated architecture of Netflix's Member Discovery System, known as Mosaic, which powers key member-facing pages. Discover the innovative strategies that ensure Mosaic's robustness and reliability, and learn how Netflix sets the standard in delivering a seamless user experience to millions worldwide.
We will explore unconventional testing and deployment strategies that address the vast scale of our user base and diverse client devices, alongside sophisticated failover strategies that safeguard service continuity. Join us to uncover how Netflix maintains its competitive edge in delivering a reliable and scalable discovery system, and be inspired by the cutting-edge techniques that power Mosaic.
The relentless pursuit of more AI is driving us to embrace calculated risks through innovative approaches and to boldly place hardware in “tent”-like structures. How does infrastructure tradeoff reliability for speed?
Race cars are built for speed and resilience, equipped with cutting-edge features to reach high velocities while maintaining a firm grip on the perilous track. What if we could apply similar features to boost the speed and resilience of AI/ML jobs running over complex networking fabrics?
In this session, we’ll dive into the key networking challenges impacting AI/ML workloads, such as NIC and link flapping, network contention, and congestion. These issues not only slow down job completion times but also eat into ROI by increasing the likelihood of interruptions and costly rollbacks. I’ll demonstrate these challenges in action and introduce a solution that enhances network visibility for AI/ML jobs while ensuring smooth, uninterrupted performance even in the face of link instability or congested paths. By addressing these issues, we can optimize the efficiency of AI/ML jobs, reduce time lost to disruptions, and improve ROI by avoiding the need to revert to previous checkpoints.
Join me to learn how to supercharge your AI/ML workloads for speed and resilience — just like a race car engineered for peak performance!
For over a decade, Facebook and most other Meta products have been powered by a single monolithic PHP application. Since 2022, we have been investing in a multitenancy framework to allow product specialization with less operational overhead. This has allowed Meta to move fast in generative AI, by providing a familiar development environment for our engineers with optimizations for inference workloads.
This presentation introduces Pinterest's KVStore, a distributed key-value store designed to support machine learning workloads that are central to Pinterest's functions. KVStore enables efficient low-latency ML feature serving with various data update methods. KVStore is crucial for Pinterest's AI/ML-driven platform, evolving to meet business needs with a focus on reliability and efficiency at scale.
At the beginning of 2023, Instagram had O(10) gpu models, a manual release process, and a manual monitoring setup. This talk will be centered around our journey to 1000 models: the bumps along the road and the foundational work built to make monitoring model health faster and more accurate. We’ll be going over model registry, the model launch process, and model stability.
While our organization excelled at maintaining server SLOs for Google Maps, we discovered that many user-impacting incidents, particularly those stemming from client-side issues like mobile app rollouts, remained undetected by server-centric monitoring. This realization prompted a strategic shift towards product reliability, prioritizing the end-user experience. This talk will discuss how we navigated this transition, sharing our progress in addressing challenges, the valuable lessons learned, and our evolving vision for a holistic, user-focused reliability strategy.