The Configerator repository provides Meta developers with a way to make changes easily and quickly to production services. By default, it pushes changes to all services at Meta in a matter of seconds, and doesn’t have the traditional ...
Rockset provides low-latency SQL access to schemaless data that is ingested in real-time. Immediate access to dynamically structured data is very powerful, enabling rapid development and iteration for products built on top, but it ...
When you think about Meta’s family of apps, what comes to mind? Maybe the over 6 thousand photos and videos created per second on Instagram, the 5 trillion photos on Facebook, or the 60 million group posts loaded each second. It’s ...
Internet services like Facebook, Instagram, and Whatsapp rely on large-scale infrastructure to support the various compute, storage, and AI workloads. With the support of data and ML techniques, we can scale our infrastructure ...
Alibaba Cloud offers a comprehensive suite of elastic computing services that are based on container technology. Alibaba Group is one of the key customers of Alibaba Cloud and all of the major applications across its large and diverse ...
This talk introduces a new user experience-focused reliability measurement that exposes end-to-end reliability guarantees across the vertical service stack used by Meta’s family of Apps. The talk discusses the difference between the ...
While the cry of “breaking apart the monolith” can be heard throughout the industry, the Istio service mesh took a different tack, and consolidated its control plane microservices into one binary. How did we get here? In ...
Cache invalidation is considered one of the hardest things in Computer Science. We, at Meta, operate some of the world’s largest cache deployments (e.g. Memcache and TAO), serving more than one quadrillion queries a day. We have ...
Apache Kafka has emerged to the de-facto standard for event streaming platform in enterprise architectures. Many business applications are moving away from data-at-rest to an event-driven architecture so that they could leverage the ...
In this presentation we will introduce Zelos. Zelos provides the exact same semantics as ZooKeeper but is built using Delos. ZooKeeper forms the foundation of Meta’s infrastructure stack and we have been using it over a decade. ...
Hosting managed relational database services in the cloud with the level of availability, reliability guarantees demanded by mission critical workloads and doing it at scale presents a set of interesting challenges. This talk will walk ...
Ime Archibong, head of New Product Experimentation (NPE) at Meta, and 12-year Meta veteran will talk about 0-1 innovation, at scale. He’ll discuss the value of experimentation as an approach, and demystify how real breakthroughs happen.
There are many tools that Creators can use to build novel AR experiences. However not many of these tools can deliver a wide ranging set of capabilities and creative assets to billions of devices with both quality and speed. In this ...
NVIDIA Map aggregates data from millions of NVIDIA DRIVE Hyperion consumer and survey data-collection vehicles for safe, reliable, and up-to-date global high-def map coverage. The platform supports automated driving functionality from ...
As a mobile app grows in users, features, and contributing engineers, there are often tradeoffs between the performance of the app and the velocity of feature growth. Messenger’s users have high performance expectations, so ...
In this talk, I’ll be providing an overview of how we use ML algorithms to detect policy violative content on YouTube across all entity types: videos, comments, livestreams, engagements, etc and keep our community safe. ML ...
An overview of how WhatsApp thinks of privacy in the messaging world. We’ll go through some of our general principles, concrete product use cases, and challenges that come with privacy at scale.
Trust & Safety at Twitch is uniquely challenging, as the vast majority of content and chat interactions unfold in real time, across a wide variety of communities with different needs, cultures, and audiences. Mitigating and ...
Building successful products is hard. Building successful products at scale? Ridiculously hard! It takes strong vision, deep dedication, consistent execution, with a healthy sprinkle of unorthodox methods. This talk shares a few ...
We’ll talk about two 0-1 products in the creator space: Subscriptions and Music Releases on IG. We’ll walk through how we built the early stages of these as visually stunning products, as well as the unique set of challenges our teams ...
In 2007, Facebook released their first iOS App. It was written in HTML, and it was supported by a single engineer. Since then, the Facebook iOS App has grown into a native ‘platform’ which supports more than 100 products, and hundreds ...
At Capital One, building beautiful, rich, and performant mobile applications for iOS and Android is essential to providing a best-in-class experience for our customers and delivering on our mission to change banking for good. However, ...
We’ve all heard much about QUIC in the past few years, and a lot has been made of its performance benefits for HTTP/3. For some of us however, HTTP/3 was always just the beginning, just the vehicle for us to get QUIC out into the ...
In a typical CDN architecture the caching tier is fronted by a load-balancing tier; response content flows from the cache to the requester through the load-balancer. With this architecture extra I/O, CPU cycles and intra-cluster ...
Transfers in high-BDP links incur a startup delay for congestion control to probe the bandwidth of the underlying link. The impact of this delay is inversely proportional to the size of the transfer since small transfers may repeatedly ...
Nestled between transport protocols (TCP, UDP, QUIC) and application protocols (HTTP, etc.) is a layer few are familiar with. Layer 4¾ sits hiding in plain sight, often only being glimpsed during curious events that raise its ...
A key feature of HTTP/3 over QUIC is the ability to send a request in the first flight with the ClientHello. 0-RTT in IETF QUIC is notably more complex than gQUIC, with multiple packet number spaces and a limit on the amplification ...
A talk about two specific DC transport tuning initiatives (a) handling sustained congestion in the network (b) tackling bursts in network. Covers the motivation, implementation overview, wins and lessons learnt for both these initiatives.
We will share the design, implementation, and production experience of BPF based platform used to tune the network transport across millions of servers at Meta.
We will demonstrate a performant and novel approach to performing NAT, that uses a unique transition mechanism utilizing a new flag introduced to the seccomp() system call, to intercept egress connect calls to opportunistically use a ...
The Wide Area Network (WAN) connects many datacenter (DC) regions and hundreds of Points of Presence (POPs) of Meta. The WAN resource is shared by several high network demand services at Meta. The network must be built for peak demand ...
This talk breaks down stage-by-stage requirements and challenges for online prediction and fully automated, on-demand continual learning. We’ll also discuss key design decisions a company might face when building or adopting a machine ...
In Meta, we had developed multiple real-time data processing infrastructure like Puma, Stylus and Turbine (SIGMOD ’16 and ICDE ’20). As Meta grows, the needs for real-time data has grown way beyond traditional data ...
Machine learning is at the heart of Pinterest and is powered by large scale ML training log collection. To solve the cost efficient data ingestion & transportation problem at Pinterest we developed MemQ, a PubSub system that ...
ML generates significant value for Meta’s infrastructure, tools, products, and users. It drives a varied set of insights; from end-user products such as recommendations and feeds on Facebook and Instagram, to infrastructure insights ...
Data Infrastructure has evolved in the last 15 years from Hadoop’s batch system, to streaming systems like Spark and Kafka and now to realtime systems like Rockset and Clickhouse. Automatic decision making based on massive data ...
The problem of deep learning and building large scale systems for production is not just one of model training, but data preprocessing as well. At production scale, just the data loading and processing part of the system can cause ...
“Machine Learning models are only as good as the data that was used to train them. Datasets are often plagued with problems such as quality, discoverability, and undesirable social biases. As data and modeling tools are becoming ...
We’re truly living under the rule of Algorithms, our day-to-day activities from news consumption, job search, and mortgage financing are increasingly being decided by algorithms. Most of these algorithms are AI-based and are ...
Q&A | Moderated by Francois Richard. Featuring Yuri Grinshteyn, Jie Huang, Christopher Bunn, Osama Abuelsorour, Amr Mahdi, Jason Flinn & Arushi Aggarwal
We will describe Owl, a new system for high-fanout distribution of large data objects to hosts in Meta’s private cloud. Owl distributes over 700 petabytes of data per day to millions of client processes. It has improved download ...
Southpaw is load balancing, scaling and QoS management system for compute-heavy inferencing services. It takes the approach of abstracting services capabilities into tokens and worklanes, where clients are granted tokens that gives ...
Engineers at Meta run thousands of services across millions of machines, and those services all have similar needs that can’t be managed by hand: configuration, deployment, monitoring, routing, orchestration, security. To solve the ...
Shrinking Production Incidents details an organized approach for reducing the overall impact of production outages. Attendees can expect to learn how to prioritize reliability-related engineering tasks based on incident postmortem ...
Highly Available and Strongly Consistent Storage Service Using Chain Replication | Kumar Mrinal & Binbin Lu – In this talk, we present Dumbo – a simple, reliable, highly available, low dependency object storage system ...
Privacy is core to Meta engineering culture, and one of our fundamental principles is data minimization. We strive to collect and create the minimum amount of data required to provide service. One critical space we’ve identified across ...
Cosmos is the exabyte-scale big data platform at Microsoft, and SCOPE is its main analytics engine. SCOPE and Cosmos support ETL pipelines, decision support systems, and machine learning pipelines. Applications range from simple ...
AI models drive several Meta products like News Feed, Ads, IG Reels, language translation to name a few. Our ranking models consume massive datasets to continuously improve user experience on our platform. In this talk, we discuss our ...
At almost two years in the making Apache Cassandra 4.0 is here. With a focus on performance and stability, it is full of interesting features. This talk takes you through a tour of the new features and performance improvements. From ...
Since 2016, we’ve built, deployed, and scaled a new BLOB storage platform at Meta, called Manifold. Manifold builds on existing BLOB storage infrastructure, but provides a richer, higher-level, general purpose API, and thereby enables ...
The unrelenting growth of the memory needs of emerging data center applications, along with ever-increasing cost and volatility of DRAM prices, has led to DRAM being a major infrastructure expense. Alternative technologies, such as ...
Holographic Video Calling – Nitin Garg During COVID, the importance of video calling grew as people were stuck and separated from their family and friends. But the 2D experience falls short of making you feel present in the same ...
Spatial Communications at Scale in Virtual Environments – Paul Boustead Talking with a group of friends face-to-face can be very engaging, with fast-paced turn-taking and overlapping conversations. You can even have such a ...
RTC3 – Justin Uberti The real-time communications industry has evolved rapidly since the release of Skype in 2003, and saw unprecedented growth during the COVID-19 pandemic. This talk will look at the trends of the last 20 years ...
Developing Machine Learning Based Speech Enhancement Models for Teams and Skype – Ross Cutler Microsoft Teams and Skype are used daily by hundreds of millions of users, and their usage has increased significantly since the ...
Can AI Disrupt Speech Compression? – Jan Skoglund AI and deep learning has radically advanced many speech and audio processing applications. For example, we have all experienced improvements in speech recognition and synthesis in ...
AV1 Encoder for RTC – Marco Paniconi In this presentation we discuss the various features and techniques that make libaom AV1 encoder suitable for RTC applications: from encoding tool selection to reducing complexity of the ...
AV1 for RTC: Current and Future – Zoe Liu In this talk, we will mainly focus on the state-of-the-art AV1 software encoding capability for its deployment in RTC use cases, taking our Aurora1 AV1 as an instantiation. RTC in essence ...
Making Meta RTC Audio More Resilient – Andy Yang The users of Meta RTC products experience a very diverse set of network conditions, some of those may be far from perfect. In this presentation, we are going to cover the following ...
Private Calling at WhatsApp – Xi Deng WhatsApp’s mission is to connect the world privately by designing a product that’s simple and private. Privacy and security is in our DNA. In this presentation, we are going to talk ...
Group Call End-to-End Encryption and the Challenges of Encrypting Large Calls – Abo-Talib Mahfoodh Meta helps billions of users connect daily by providing real time communication services. Group call is one of these services were ...
RTC @Scale, Resilience and Encryption – Live Q&A RTC @Scale, Resilience and Encryption Session – Live Q&A with Andy Yang, Xi Deng, and Abo-Talib Mahfoodh
Efficient software and hardware failure remediations are the foundations for sustaining high fleet availability at large-scale environments such as Meta. In this talk, we will describe the general architecture that we use to maximize ...
SLIs (Service Level Indicators) and SLOs (Service Level Objectives) are industry-standard concepts to measure the long-term reliability of systems. In this presentation, we are going to talk about SLICK, the central SLO tracking ...
In this talk we present how we trained a 530B parameter language model on a DGX SuperPOD with over 3,000 A100 GPUs and a high speed Infiniband interconnect, and how we can scale to even larger models. We explore three types of ...
Meta uses a strongly consistent distributed log storage system to broadcast updates in graphs, deliver signals to ML training pipelines, and collect data for analytics. All of these cases require the underlying log system to be highly ...
At Meta, a large part of our data is ephemeral in nature, such as Instagram or Meta Stories which need to be deleted after a specific time regardless of the action taken by the user. This is sometimes referred to as Time to Live (TTL). ...
Live Panel – “The Importance of Audio Today”. Recently, Audio has been front and center with the emergence of new audio-only products and experiences and many new audio-focused investments to enhance video viewing. ...
This presentation will highlight the latest improvements of the VOD-targeted high-latency Constant Rate Factor (CRF) and Variable Bit Rate (VBR) modes of the SVT-AV1 encoder. It will first present the latest SVT-AV1 cycles-quality ...
In this talk, we will discuss the current state in terms of bitrate/quality and complexity of Two Orioles’ Eve video encoder for the VP9 & AV1 video codecs. VP9 provides meaningful quality improvements over H.264 with a mature ...
Video quality of User Generated Content (UGC) is extremely difficult to wrangle with due to their high diversity of contents and quality. They bring new challenges to how we traditionally measured and assessed video quality. Most ...
AVQT, short for Advanced Video Quality Tool, is a macOS based command line tool which estimates perceptual video quality of compressed videos that might contain video coding and scaling artifacts. Utilizing the AVFoundation framework, ...
Like the rest of the video world, Facebook Video has significantly grown year to year. While we celebrate the growth rate, we are also concerned about the resources consumption to support the growth, which became worse during COVID. ...
Facebook and user-generated content (UGC) platforms encode videos at “billion-scale” and deliver them worldwide to a variety of devices (Mobile/Laptop/TV) across different networks. The popularity of UGC videos can vary widely ranging ...
A/B testing on video isn’t just about tweaking recommendations or picking the perfect thumbnail. Every aspect of video benefits from rapid experimentation including the infrastructure – streaming algorithms, codecs, bitrates, caching ...
Join the @Scale Mailing List and Get the Latest News & Event Info
Code of Conduct
To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy