LATEST ON SCALE

Search below to find the latest videos on @Scale's  trending topics and series.

SORT

TOPIC
@SCALE SERIES
TYPE
DATE
TAGS
17 RESULTS
CLEAR ALL
Data, Systems and Networking

Network Communication Debuggability and Observability at Scale

Training large models like Llama3 at scale introduces unprecedented challenges in terms of failures and performance degradation. Existing monitoring techniques fall short in providing the necessary insights for effective troubleshooting. This talk presents our experience in instrumenting the collective library to gain deeper visibility into network operations. We will share techniques for real-time performance tracing, […]
WATCH VIDEO
Data, Systems and Networking

Live Q&A Session 3

WATCH VIDEO
Data, Systems and Networking

Faster Than Fast: Networking and Communication Optimizations for Llama 3

Network and Collective Communication stack plays a pivotal role in extracting the best performance out of large GenAI Clusters. In this talk, we will go over in-depth Network and Communicational library tuning that helped achieve optimal performance for GenAI Models such as LLaMA3. We’ll touch on both optimizations, from training workload as well as model […]
WATCH VIDEO
Data, Systems and Networking

Designing Scalable Networks for Large AI Clusters: Challenges and Key Insights

As AI continues to revolutionize industries, the demand for large-scale AI training clusters is rapidly increasing to meet the growing need for advanced computational capabilities. Fields such as autonomous driving, medical image analysis, language model training, financial modeling, and drug discovery require robust and scalable infrastructure to handle the complexity and scale of AI training […]
WATCH VIDEO
Data, Systems and Networking

Scheduler and Sharding Considerations for Network Efficiency

Every generation of Large Language models requires an order of magnitude large compute and networking infrastructure. Various model and data parallelism techniques distribute the computational complexity of the model onto infrastructure. Achieving optimal network communication performance requires careful consideration of how models map to network architecture and hierarchy. One of the important layers of the […]
WATCH VIDEO
Data, Systems and Networking

Live Q&A Session 2

WATCH VIDEO
Data, Systems and Networking

High Network Reliability and Availability in FE and BE for Scalable Training Solutions

Meta has focused on enhancing reliability in Backend (BE) and Frontend (FE) networks for AI training, ensuring low latency and high throughput for GPUs and stable data flow for checkpointing. We’ve implemented a dual monitoring strategy using SLI and evidence-based collections for improved network health analysis and faster issue detection. Stricter controls, on-box agents, and […]
WATCH VIDEO
Data, Systems and Networking

Alibaba HPN: A Data Center Network for Large Language Model Training

Due to the differences between LLMs and general cloud computing (e.g., in terms of traffic patterns and fault tolerance), traditional data center networks are not well-suited for LLM training. Unlike general cloud computing which generates millions of small flows (e.g., lower than 10Gbps), LLM training produces a small number of periodic, bursty flows (e.g., 400Gbps) […]
WATCH VIDEO
Data, Systems and Networking

Live Q&A Session 1

WATCH VIDEO
Data, Systems and Networking

AI Impact on Backbone

Demand for AI capacity from our family of apps and products has accelerated the growth of our large backbone network. Initially, we expected AI-driven traffic to mainly stay within data centers. However, high replication and data freshness requirements, co-location challenges, and cross-region inference needs have increased traffic by 30-50%. To manage this, we’ve deepened our […]
WATCH VIDEO
Data, Systems and Networking

Future Challenges for HPC Networking @Scale

Over the last three decades, extreme-scale high-performance computing (HPC) systems have evolved from highly-specialized hardware running custom software environments to platforms that are almost entirely composed of commodity components. While some aspects of large-scale HPC systems continue to be enhanced to meet performance and scalability demands, HPC systems have been distinguished by interconnect technology. The […]
WATCH VIDEO

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy