TOPIC: Machine Learning and AI

AI Infra @Scale

MAY 18, 2023 @ 9:00 AM PDT - 11:20 AM PDT

Meta's Engineering and Infrastructure teams are excited to host AI Infra @Scale, a one-day virtual event featuring a range of speakers from Meta who will unveil the latest AI infrastructure investments and innovations powering Meta's products and services. Join us as we share how Meta is creating the next generation of AI infrastructure to build and scale technologies to power Meta’s products and services today and in the future. We’ll also discuss how these advances in AI will positively impact the broader community. Register today and check back for upcoming speaker and agenda announcements.

RSVPS CLOSED
AGENDA SPEAKERS

ABOUT EVENT

AI Infra @Scale 2023 will be hosted virtually and feature a range of speakers from Meta’s engineering and infrastructure teams. Meta's Head of Infrastructure, Santosh Janardhan, will deliver opening and closing remarks, while also guiding us through six exciting technical presentations on some of Meta's latest AI infrastructure investments. Additionally, we’re thrilled to host a fireside chat with a panel of Meta AI infrastructure leaders as they discuss "The Future of AI Infra: The Opportunities and Challenges That Await Us On Our Journey.” The event will take place via webcast on May 18, 2023.

EVENT AGENDA

Event times below are displayed in PT.

May 18

09:00 AM - 09:05 AM
Opening Remarks
Speaker SANTOSH JANARDHAN,Meta
ADDITIONAL RESOURCES
Reimagining Our Infrastructure for the AI Age  read more
09:05 AM - 09:25 AM
Meta's Research SuperCluster(RSC): Accelerating AI Research at Scale

Meta is facing a challenging and exciting future as it expands beyond its current capabilities in the social space. Ensuring our platform is open to as many diverse cultures, languages and perspectives is a significant challenge that requires intensive large-scale AI models. The complexities of adding a virtual reality Metaverse further increases the challenge space, requiring much larger models with greater numbers of modalities and parameters.

Meta anticipated these challenges and has built a dedicated high-performance state-of-the art cluster to accelerate AI research. We present the architectural choices that went into building the cluster composed of 16K GPUs, high-performance storage and a non-blocking Infiniband network. We will discuss some of the lessons learned and how they have been applied to Meta in general.

Finally, we reflect on the impact the RSC has had on our research projects, and provide some insight into future directions.

Speaker Scott Jeschonek,Meta AI
Speaker Kalyan Saladi,Meta AI
ADDITIONAL RESOURCES
Pursuing groundbreaking scale and accelerating research using Meta’s Research SuperCluster  read more
09:25 AM - 09:40 AM
NEXT-GENERATION DATA CENTER DESIGN

Building AI capacity is essential to the future of our company, and supporting AI workloads at scale requires a different approach than scaling to support our regular online services. Our new data center design will support the next generation of AI systems. We are building an increased level of flexibility into our design, which will allow us to pivot in response to shifts and changes in the AI space.

The new design will have fewer but denser racks to support large scale AI clusters, allowing us to have a smaller footprint while serving the same capacity as our previous data center designs. This design was created with efficiency at the forefront. Each data center going forward will be optimized for water and energy usage depending on the site/region, and will continue to incorporate sustainable features to ensure efficient facilities. We anticipate this design will also be faster and cheaper to build.

Speaker Alan Duong,Meta
09:40 AM - 09:55 AM
Pytorch 2.0

What makes PyTorch beloved makes it harder to compile. After almost five years, we finally cracked the technologies that made it possible to compile any PyTorch model, resulting in a step-function change in PyTorch’s approach to execution efficiency. We called it PyTorch 2.0.

PyTorch 2.0 delivers significant performance improvements over a wide variety of models, often with just a simple one-liner change. This talk focuses on the two critical technologies underlying PyTorch 2.0, TorchDynamo and TorchInductor.

PyTorch 2.0 was released in March. But do not mistake it as the end of the story. The first release of PyTorch 2.0 marks the beginning of a roadmap for improving PyTorch execution efficiency via compiled mode.

Speaker Peng Wu,Meta
09:55 AM - 10:15 AM
MTIA: Meta's First Generation of AI Accelerators

Meta has traditionally relied on using CPU-based servers for running AI workloads, but the increasing compute and memory requirements of these models have pushed the company towards using specialized solutions such as GPUs or other hardware accelerators. This talk describes the company's effort in constructing its first silicon designed for its internal AI workloads and systems; It describes the accelerator architecture and platform design, and the software stack for enabling and optimizing workloads. It also touches upon the upcoming challenges and evolving requirements that need to be accommodated moving forward.

Speaker Roman Levenstein,Meta
Speaker Amin Firoozshahian,Meta
Speaker Joel Coburn,Meta
Speaker Olivia Wu,Meta
ADDITIONAL RESOURCES
MTIA v1: Meta’s first-generation AI inference accelerator  read more
10:15 AM - 10:30 AM
Break
10:30 AM - 10:40 AM
MSVP: Meta's Scalable Video Processor

This presentation will introduce MSVP (Meta's Scalable Video Processor), the first generation server grade video processing hardware accelerator of its kind developed at Meta. We will describe the motivation behind it, the architecture, and some of the novel algorithms that are in the video encoder and other video processing blocks to achieve high video quality. We will also describe how the hardware accelerators are used in Meta’s data center to support processing and transcoding billions of videos every day and provide premium video quality to end users, while saving power.

Speaker Harikrishna Reddy,Meta
Speaker Ioannis Katsavounidis,Meta
ADDITIONAL RESOURCES
MSVP: Meta’s first ASIC for video transcoding  read more
10:40 AM - 10:50 AM
Gen AI-Assisted Code Authoring at Meta

At Meta, we have built upon existing research published by FAIR to develop our own AI-Assisted code authoring tools. The freedom to experiment with the model combined with the ability to train on first-party code has enabled us to deliver tooling that has had a measurable impact on developer productivity.

Speaker Michael Bolin,Meta
10:50 AM - 11:15 AM
Fireside Chat: The Future of AI Infra: The Opportunities and Challenges That Await Us On Our Journey

This panel discussion will focus on The Future of AI Infra: The Opportunities and Challenges That Await Us On Our Journey. Moderated by Irina Kofman, head of XAI and responsible for cross-company AI efforts at Meta, this panel features leaders across Meta's infrastructure organization and will discuss the challenges and opportunities they see with building world-class, custom infrastructure specially built for AI.

Speaker Irina Kofman,Meta AI
Speaker Alexis Björlin,Meta
Speaker Aparna Ramani,Meta
Speaker Kim Hazelwood,Meta AI
Speaker Rachel Peterson,Meta
11:15 AM - 11:20 AM
Closing Remarks
Speaker SANTOSH JANARDHAN,Meta

SPEAKERS AND MODERATORS

Santosh Janardhan is the head of infrastructure at Meta, where he supports the teams... read more

SANTOSH JANARDHAN

Meta

Scott Jeschonek is a Technical Program Manager at Meta, overseeing the Research SuperCluster. Scott... read more

Scott Jeschonek

Meta AI

Kalyan Saladi joined Meta in 2015. He works on the AI Research Super-Cluster as... read more

Kalyan Saladi

Meta AI

Alan Duong is Global Director of Data Centers Engineering team at Meta Platforms, where... read more

Alan Duong

Meta

Dr. Peng Wu is the engineering manager of the PyTorch Compiler team at Meta.... read more

Peng Wu

Meta

Roman Levenstein is leading the development of the compiler and SW stacks for Meta's... read more

Roman Levenstein

Meta

Amin Firoozshahian is a member of the ASIC architecture team, working on architecture definition... read more

Amin Firoozshahian

Meta

Joel Coburn is a software engineer on the AI System Co-Design team at Meta... read more

Joel Coburn

Meta

Olivia Wu is a design lead for the AI System Co-Design team at Meta... read more

Olivia Wu

Meta

Harikrishna Reddy is a Technical Lead in the Infra Silicon Team at Meta, leading... read more

Harikrishna Reddy

Meta

Dr. Ioannis Katsavounidis is part of the Video Infrastructure team, leading technical efforts in... read more

Ioannis Katsavounidis

Meta

Michael Bolin is a software engineer who has spent the past decade working in... read more

Michael Bolin

Meta

Irina is the head of XAI, where she is responsible for the cross-company AI... read more

Irina Kofman

Meta AI

Dr. Alexis B. Björlin is Vice President of Infrastructure at Meta, responsible for shaping... read more

Alexis Björlin

Meta

Aparna Ramani is VP of Engineering at Meta, responsible for Data, Developer and AI... read more

Aparna Ramani

Meta

Kim Hazelwood is an engineering leader whose expertise lies at the intersection of artificial... read more

Kim Hazelwood

Meta AI

As Vice President for Data Center Strategy, Rachel Peterson oversees Meta’s global infrastructure expansion... read more

Rachel Peterson

Meta
UPCOMING EVENT   May 22, 2024 Data @Scale

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...
UPCOMING EVENT   June 12, 2024 Systems @Scale

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...
UPCOMING EVENT   07/31/2024 AI @Scale

AI Infra @Scale 2024

Meta's Engineering and Infrastructure teams are excited to host AI Infra @Scale, a one-day virtual event featuring a range of speakers from Meta who will unveil the latest AI infrastructure investments and innovations powering Meta's...
UPCOMING EVENT   August 7, 2024 Product @Scale

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. This year focuses on discussions that explore the creator ecosystem, and how AI will play a role in scaling...
UPCOMING EVENT   September 4-5, 2024 (2 day event) Networking @Scale

Networking @Scale 2024

Networking @Scale is a technical conference for engineers that build and manage large-scale networks. Meta’s Networking Infrastructure team is excited to host Networking @Scale, a two-day virtual event featuring a range of speakers from Meta...
UPCOMING EVENT   October 9, 2024 Reliability @Scale

Reliability @Scale 2024

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems and products at massive scale. Whether it’s novel design decisions, or outages that impact billions...
UPCOMING EVENT   October 23, 2024 Mobile @Scale

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...
UPCOMING EVENT   November 20, 2024 Video @Scale

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...
PAST EVENT   March 20, 2024 @ 9am PT - 3pm PT RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

EXPLORE OTHER SERIES

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy