TOPIC: Machine Learning and AI

AI @Scale 2022

SEPTEMBER 28, 2022 @ 09:00 AM - SEPTEMBER 28, 2022 @ 02:20 PM PT
Designed for engineers interested in solving machine learning scaling problems.


AI @Scale is a technical conference for engineers that are interested in solving scalability problems in machine learning. This is an exciting area that brings many challenges in the nexus of distributed system, hardware design and machine learning technique innovations. The @Scale community focuses on bringing people together to discuss these challenges and collaborate on the development of new solutions.

AI @Scale will be hosted virtually. Joining us are speakers from AWS, Microsoft, Salesforce, Fiddler, Outerbounds, Unity,, Cruise and Meta. The event will be hosted on September 28, 2022 with talks themed around AI systems implementation at scale.


Event times below are displayed in PT.

September 28

09:00 AM - 09:20 AM
Supporting AI holistically Across the Infra stack

In this talk, we delve into how we have designed Meta’s data centers, network, hardware, storage and software systems to support the needs of existing and emerging AI workloads.

SPEAKER Kaushik Veeraraghavan,Meta
09:20 AM - 09:40 AM
The Laws of ML at Scale

Scaling Machine Learning Models from the laptop to production is non-linear. Different rules and laws apply at scale. Ketan Umare (CEO, TSC, will cover the three laws for ML at scale. These laws will be grounded using anecdotes that the team encountered while working with some of the top companies utilizing ML in production. The talk will cover the following topics,

– Machine Learning products are resource intensive. Oftentimes the cost of ML projects balloon and access to infrastructure is scarce without predefined ROI.

– ML at scale is a team sport. Production teams need a common platform to collaborate efficiently. Small in-efficiencies in collaboration can slow down an entire organization.

If things can go wrong, at scale they will go wrong. We need a safety net - versioning, reproducibility are important in delivering robust ML products at Scale. To deliver these robust products, just code and data is not enough, it is essential to capture the infrastructure & configuration.

SPEAKER Ketan Umare,Union.AI
09:40 AM - 10:00 AM
Human Friendly, Production Ready Data Science Stack

There is a pressing need for tools and workflows that meet data scientists where they are. This is also a serious business need: How to enable an organization of data scientists, who are not software engineers by training, to build and deploy end-to-end machine learning workflows and applications independently? In this talk, we discuss the problem space and the approach we took to solving it at Netflix. We wanted to provide the best possible user experience for data scientists, allowing them to focus on parts they like (modeling using their favorite off-the-shelf libraries) while providing robust built-in solutions for the foundational infrastructure: data, compute, orchestration, and versioning.

SPEAKER Savin Goyal,Outerbounds
10:00 AM - 10:20 AM
ML Infrastructure for Autonomous Vehicles @ Cruise

In this talk I will present the machine learning infrastructure that supports the large scale data preparation and training at Cruise. And how it enables the model improvements in an effective, reliable, continuous and automatic fashion.

SPEAKER Alexander Sidorov,Cruise
10:20 AM - 10:40 AM
OPT-175B: LLM Development Lifecycle & Challenges

This talk will be walking through the development life-cycle of OPT-175B, the first 175B-parameter language model that has been made publicly available for research-use in May 2022. Topics covered will include limitations of transferring results from small-scale experiments, working around hardware failures compounded at scale, numerical instabilities that may occur mid-way through training, amongst others.

SPEAKER Susan Zhang,Meta
10:40 PM - 11:00 PM
Customizable Computer Vision Expands Data Access Without Compromising Privacy

Computer vision has made huge strides recently, helped by large-scale labeled datasets. However, these datasets had no guarantees or analysis diversity. Additionally, privacy concerns may limit the ability to collect more data. These problems are particularly acute in human-centric computer vision for AR/VR applications. An emerging alternative to real-world data that alleviates some of these issues is synthetic data. However, creating synthetic data generators is incredibly challenging and prevents researchers from exploring their usefulness. To promote research into the use of synthetic data, we release a set of data generators for computer vision. We found that pre-training a network using synthetic data and fine-tuning on real-world target data results in models that outperform models trained with the real data alone. Furthermore, we find remarkable gains when limited real-world data is available. These freely available data generators should enable a wide range of research into the emerging field of simulation to real transfer learning for computer vision.

SPEAKER Sujoy Ganguly,Unity
11:00 AM - 11:30 AM
Live Q&A Session

Live Q&A session with all speakers

11:30 AM - 11:40 AM
11:40 AM - 12:00 PM
AI @Scale at Microsoft

Microsoft’s AI @Scale encompasses multiple dimensions. It gives customers access to state-of-the-art large-scale AI models, training and inferencing optimization tools, and supercomputing resources; It provides cross-platform AI runtime that enables customers to infuse AI into every aspect of their platforms and products, across cloud and edge. In this talk, I will give an overview of Microsoft AI @Scale. I will focus on the roles that AI software plays, and discuss the challenges we are facing and the areas for potential collaborations between AI software and hardware.

SPEAKER Yuan Yu,Microsoft
12:00 PM - 12:20 PM
At-scale Training with pyTorch and Amazon SageMaker

This talk will explore how AWS customers use a combination of pyTorch and the managed SageMaker platform to perform efficient and easy-to-use training of large models – and how we built a SageMaker-specific distribution backend.

SPEAKER Andrea Olgiati,AWS
12:20 PM - 12:40 PM
The Future of Inference for Complex AI Applications

Inference of single-model applications has, in recent years, become a multi-stage process, combining online Feature Stores with specialized model-hosting runtimes like ONNX and Triton. Less simplistic AI applications such as those in chatbots, mixed-mode recommenders, and search engines fold "candidate retrieval" steps into the mix. As these applications get more sophisticated, they typically now require all of feature-hydration, semantic encoding/embedding, candidate selection, and candidate re-ranking (not to mention any pre/post-processing or "format massaging" in between any of these steps). While for batch and streaming applications, composable DAG engines and DSLs have been developed to allow these applications to become arbitrarily deep, in the on-demand/realtime world, DAG engines which are open source / open-specification are few and far between, even though they are sorely needed.

SPEAKER Jake Mannix,Salesforce
12:40 PM - 01:00 PM
The New Vision of Compiler-Accelerated PyTorch

What makes PyTorch beloved makes it hard to compile. This is a unique challenge faced by an eager-first ML framework like ours. In this talk, we will demonstrate the feasibility of funneling any PyTorch models through the compilers with minimal user intervention and the exciting possibility of a compiler-accelerated PyTorch. And we will share promising early results on our new solution, TorchDynamo.

SPEAKER Peng Wu,Meta
01:00 PM - 01:20 PM
Model Performance Monitoring - A Practioner's Perspective

Artificial Intelligence is increasingly playing an integral role in determining our day-to-day experiences. Moreover, with the proliferation of AI-based solutions in areas such as hiring, lending, criminal justice, healthcare, and education, the resulting personal and professional implications of AI are far-reaching. The dominant role played by AI models in these domains has led to a growing concern regarding potential performance drift and bias in these models, and demand for model transparency and interpretability. Model Monitoring has become a prerequisite for building trust and adoption of AI systems in high-stakes domains requiring reliability and safety such as healthcare and automated transportation, and critical industrial applications with significant economic implications such as predictive maintenance, exploration of natural resources, and climate change modeling. In this talk, we will be talking about how Fiddler.AI is building Model Performance Management that solves this problem by continuously monitoring AI algorithms for performance and bias issues and reports actionable insights with explanations to the entire organization. For more information, you can visit or follow us on Twitter @fiddlerlabs.

SPEAKER Krishna Gade,Fiddler
01:20 PM - 01:50 PM
Live Q&A Session
01:50 PM - 02:20 PM
Fireside Chat
SPEAKER Yann LeCun,Meta
SPEAKER Ludovic Hauduc,Meta


Kaushik Veeraraghavan obtained his PhD in Computer Science focusing on Distributed Systems from the University of Michigan. He has worked... read more

Kaushik Veeraraghavan


Ketan Umare is the TSC Chair for Flyte (incubating under LF AI & Data). He is also currently the Chief... read more

Ketan Umare


Savin is the co-founder and CTO of Outerbounds - where his team is building the modern ML stack to accelerate... read more

Savin Goyal


Alexander Sidorov is Sr. Staff Software Engineer @ Cruise with over a decade of experience building ML Infrastructure. At Cruise... read more

Alexander Sidorov


Susan Zhang is a Research Engineer at Meta read more

Susan Zhang


Sujoy Ganguly is the Head of Applied Machine Learning Research and Computer Vision at Unity Technologies. Sujoy earned his Ph.D.... read more

Sujoy Ganguly


Yuan Yu is a Technical Fellow at Microsoft. He leads Microsoft’s AI Frameworks team, which is responsible for the development... read more

Yuan Yu


Andrea is a Senior Principal Engineer at Amazon Web Services. He is the Chief Engineer for Amazon SageMaker, a fully-managed... read more

Andrea Olgiati


ML Engineer and system architect who's spent the past couple decades working on scalable recommender systems, search relevance, and building... read more

Jake Mannix


Peng Wu


Krishna Gade is the Founder/CEO of Fiddler.AI, a Model Performance Monitoring startup. Prior to founding Fiddler, Krishna led engineering teams... read more

Krishna Gade


Yann LeCun is VP & Chief AI Scientist at Meta and Silver Professor at NYU affiliated with the Courant Institute... read more

Yann LeCun


Ludo Hauduc supports the AI Infrastructure team at Meta, with the mission to design, build, optimize and scale AI infrastructure... read more

Ludovic Hauduc



To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy