LATEST ON @SCALE

Search below to find the latest videos on @Scale's  trending topics and series.

SORT

TOPIC
@SCALE SERIES
TYPE
DATE
TAGS
17 RESULTS
CLEAR ALL
Systems and Networking

Keynote from Microsoft

WATCH VIDEO
Systems and Networking

10x Backbone: Scaling Backbone Connectivity to Serve AI Demands

In this presentation we will share details about Meta’s Backbone Network, its recent developments, and the journey to support increasing demands that our existing and new AI workloads place on the network. New technologies and designs to address 10x scaling needs in the are discussed as well as how some of these same principles are […]
WATCH VIDEO
Systems and Networking

Track 2 – Live Q&A Session #2

WATCH VIDEO
Systems and Networking

Scaling Llama4 Training to 100K

Llama 4’s pre-training scale is growing exponentially, with 100K GPUs used, a 6x increase from its predecessor. Initializing training takes longer, and failure probability increases with larger scale. Training throughput aka Effective Training time degrades significantly as a result. To address these challenges, researchers are experimenting in parallel for faster initialization of large scale jobs, […]
WATCH VIDEO
Systems and Networking

Performance Optimizations at 100K+ Scale

WATCH VIDEO
Systems and Networking

Track 2 – Live Q&A Session #1

WATCH VIDEO
Systems and Networking

PyTorch Symmetric Memory: A New Paradigm for Programming Distributed AI

Recent model advancements have highlighted the need for customized communication. In response, PyTorch introduces Symmetric Memory, a distributed programming model that creates a global address space for data spanning multiple GPUs’ memory. In this talk, we will demonstrate how developers can author their own communication kernels at the device level. Additionally, we will show how […]
WATCH VIDEO
Systems and Networking

Enhancing Runtime Reliability in LLM Training via Fine-Grained Observability

As large language model (LLM) training scales across tens of thousands of GPUs, ensuring runtime reliability becomes both more challenging and more critical for maintaining efficiency. This talk explores how fine-grained observability can substantially enhance reliability in LLM training at scale. First, we discuss automated methods for detecting faulty machines by leveraging distinctive monitoring metric […]
WATCH VIDEO
Systems and Networking

Inference Deployments and Comms Implication

This talk addresses the challenges and solutions for scaling large language model (LLM) inference to support up to 1 billion monthly active users across platforms for Meta AI, focusing on compute-bound prefill and memory-bound decode stages. Key challenges include the quadratic scaling of attention operations with sequence length and the linear growth of the KV […]
WATCH VIDEO
Systems and Networking

Track 1 – Live Q&A Session #2

WATCH VIDEO
Systems and Networking

Architecting Multi-tenant Data-center Networks for GPU Customers

Generative AI is revolutionizing cloud data centers, pushing the limits of what is possible in computing. While the industry already knows how to virtualize the regular data-center networks, virtualizing the GPU networks in a cloud introduces new challenges. In our talk we will share Google’s architecture, how we create cutting-edge cloud data centers tailored for […]
WATCH VIDEO
Systems and Networking

Transparent MultiNIC routing for large AI Models

In large scale AI training models necessitates the transfer of terabits of data per second for various needs eg checkpointing, data ingestion, and hot sparing. However, current network configurations and a lack of application awareness regarding underlying hardware resources often result in suboptimal resource utilization, leading to delayed checkpoint flushes, increased GPU idle, and failover […]
WATCH VIDEO

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy