This talk breaks down stage-by-stage requirements and challenges for online prediction and fully automated, on-demand continual learning. We’ll also discuss key design decisions a company might face when building or adopting a machine ...
In Meta, we had developed multiple real-time data processing infrastructure like Puma, Stylus and Turbine (SIGMOD ’16 and ICDE ’20). As Meta grows, the needs for real-time data has grown way beyond traditional data ...
Machine learning is at the heart of Pinterest and is powered by large scale ML training log collection. To solve the cost efficient data ingestion & transportation problem at Pinterest we developed MemQ, a PubSub system that ...
ML generates significant value for Meta’s infrastructure, tools, products, and users. It drives a varied set of insights; from end-user products such as recommendations and feeds on Facebook and Instagram, to infrastructure insights ...
Data Infrastructure has evolved in the last 15 years from Hadoop’s batch system, to streaming systems like Spark and Kafka and now to realtime systems like Rockset and Clickhouse. Automatic decision making based on massive data ...
The problem of deep learning and building large scale systems for production is not just one of model training, but data preprocessing as well. At production scale, just the data loading and processing part of the system can cause ...
“Machine Learning models are only as good as the data that was used to train them. Datasets are often plagued with problems such as quality, discoverability, and undesirable social biases. As data and modeling tools are becoming ...
We’re truly living under the rule of Algorithms, our day-to-day activities from news consumption, job search, and mortgage financing are increasingly being decided by algorithms. Most of these algorithms are AI-based and are ...
Q&A | Moderated by Francois Richard. Featuring Yuri Grinshteyn, Jie Huang, Christopher Bunn, Osama Abuelsorour, Amr Mahdi, Jason Flinn & Arushi Aggarwal
We will describe Owl, a new system for high-fanout distribution of large data objects to hosts in Meta’s private cloud. Owl distributes over 700 petabytes of data per day to millions of client processes. It has improved download ...
Southpaw is load balancing, scaling and QoS management system for compute-heavy inferencing services. It takes the approach of abstracting services capabilities into tokens and worklanes, where clients are granted tokens that gives ...
Engineers at Meta run thousands of services across millions of machines, and those services all have similar needs that can’t be managed by hand: configuration, deployment, monitoring, routing, orchestration, security. To solve the ...
Shrinking Production Incidents details an organized approach for reducing the overall impact of production outages. Attendees can expect to learn how to prioritize reliability-related engineering tasks based on incident postmortem ...
Highly Available and Strongly Consistent Storage Service Using Chain Replication | Kumar Mrinal & Binbin Lu – In this talk, we present Dumbo – a simple, reliable, highly available, low dependency object storage system ...
Privacy is core to Meta engineering culture, and one of our fundamental principles is data minimization. We strive to collect and create the minimum amount of data required to provide service. One critical space we’ve identified across ...
Cosmos is the exabyte-scale big data platform at Microsoft, and SCOPE is its main analytics engine. SCOPE and Cosmos support ETL pipelines, decision support systems, and machine learning pipelines. Applications range from simple ...
AI models drive several Meta products like News Feed, Ads, IG Reels, language translation to name a few. Our ranking models consume massive datasets to continuously improve user experience on our platform. In this talk, we discuss our ...
At almost two years in the making Apache Cassandra 4.0 is here. With a focus on performance and stability, it is full of interesting features. This talk takes you through a tour of the new features and performance improvements. From ...
Since 2016, we’ve built, deployed, and scaled a new BLOB storage platform at Meta, called Manifold. Manifold builds on existing BLOB storage infrastructure, but provides a richer, higher-level, general purpose API, and thereby enables ...
The unrelenting growth of the memory needs of emerging data center applications, along with ever-increasing cost and volatility of DRAM prices, has led to DRAM being a major infrastructure expense. Alternative technologies, such as ...
Holographic Video Calling – Nitin Garg During COVID, the importance of video calling grew as people were stuck and separated from their family and friends. But the 2D experience falls short of making you feel present in the same ...
Spatial Communications at Scale in Virtual Environments – Paul Boustead Talking with a group of friends face-to-face can be very engaging, with fast-paced turn-taking and overlapping conversations. You can even have such a ...
RTC3 – Justin Uberti The real-time communications industry has evolved rapidly since the release of Skype in 2003, and saw unprecedented growth during the COVID-19 pandemic. This talk will look at the trends of the last 20 years ...
Developing Machine Learning Based Speech Enhancement Models for Teams and Skype – Ross Cutler Microsoft Teams and Skype are used daily by hundreds of millions of users, and their usage has increased significantly since the ...
Can AI Disrupt Speech Compression? – Jan Skoglund AI and deep learning has radically advanced many speech and audio processing applications. For example, we have all experienced improvements in speech recognition and synthesis in ...
AV1 Encoder for RTC – Marco Paniconi In this presentation we discuss the various features and techniques that make libaom AV1 encoder suitable for RTC applications: from encoding tool selection to reducing complexity of the ...
AV1 for RTC: Current and Future – Zoe Liu In this talk, we will mainly focus on the state-of-the-art AV1 software encoding capability for its deployment in RTC use cases, taking our Aurora1 AV1 as an instantiation. RTC in essence ...
Making Meta RTC Audio More Resilient – Andy Yang The users of Meta RTC products experience a very diverse set of network conditions, some of those may be far from perfect. In this presentation, we are going to cover the following ...
Private Calling at WhatsApp – Xi Deng WhatsApp’s mission is to connect the world privately by designing a product that’s simple and private. Privacy and security is in our DNA. In this presentation, we are going to talk ...
Group Call End-to-End Encryption and the Challenges of Encrypting Large Calls – Abo-Talib Mahfoodh Meta helps billions of users connect daily by providing real time communication services. Group call is one of these services were ...
RTC @Scale, Resilience and Encryption – Live Q&A RTC @Scale, Resilience and Encryption Session – Live Q&A with Andy Yang, Xi Deng, and Abo-Talib Mahfoodh
Efficient software and hardware failure remediations are the foundations for sustaining high fleet availability at large-scale environments such as Meta. In this talk, we will describe the general architecture that we use to maximize ...
SLIs (Service Level Indicators) and SLOs (Service Level Objectives) are industry-standard concepts to measure the long-term reliability of systems. In this presentation, we are going to talk about SLICK, the central SLO tracking ...
In this talk we present how we trained a 530B parameter language model on a DGX SuperPOD with over 3,000 A100 GPUs and a high speed Infiniband interconnect, and how we can scale to even larger models. We explore three types of ...
Meta uses a strongly consistent distributed log storage system to broadcast updates in graphs, deliver signals to ML training pipelines, and collect data for analytics. All of these cases require the underlying log system to be highly ...
At Meta, a large part of our data is ephemeral in nature, such as Instagram or Meta Stories which need to be deleted after a specific time regardless of the action taken by the user. This is sometimes referred to as Time to Live (TTL). ...
Live Panel – “The Importance of Audio Today”. Recently, Audio has been front and center with the emergence of new audio-only products and experiences and many new audio-focused investments to enhance video viewing. ...
This presentation will highlight the latest improvements of the VOD-targeted high-latency Constant Rate Factor (CRF) and Variable Bit Rate (VBR) modes of the SVT-AV1 encoder. It will first present the latest SVT-AV1 cycles-quality ...
In this talk, we will discuss the current state in terms of bitrate/quality and complexity of Two Orioles’ Eve video encoder for the VP9 & AV1 video codecs. VP9 provides meaningful quality improvements over H.264 with a mature ...
Video quality of User Generated Content (UGC) is extremely difficult to wrangle with due to their high diversity of contents and quality. They bring new challenges to how we traditionally measured and assessed video quality. Most ...
AVQT, short for Advanced Video Quality Tool, is a macOS based command line tool which estimates perceptual video quality of compressed videos that might contain video coding and scaling artifacts. Utilizing the AVFoundation framework, ...
Like the rest of the video world, Facebook Video has significantly grown year to year. While we celebrate the growth rate, we are also concerned about the resources consumption to support the growth, which became worse during COVID. ...
Facebook and user-generated content (UGC) platforms encode videos at “billion-scale” and deliver them worldwide to a variety of devices (Mobile/Laptop/TV) across different networks. The popularity of UGC videos can vary widely ranging ...
A/B testing on video isn’t just about tweaking recommendations or picking the perfect thumbnail. Every aspect of video benefits from rapid experimentation including the infrastructure – streaming algorithms, codecs, bitrates, caching ...
Malicious synthetic media – both deepfakes and cheapfakes – are rising in prevalence and importance. End users are rapidly losing trust in media, and their ability to tell authentic media from inauthentic has greatly diminished. This ...
Understanding video content has been a focus for video-sharing platforms. It is one of the most important driving forces for the growth in distribution, discovery, user experience and monetization. Instream video understanding is the ...
Two years ago, iStreamPlanet set out to build a cloud-native software transcoder with the reliability and feature set to support some of the highest profile live channels and events in the world. Some of our goals included: 4+ 9’s of ...
Reducing end-to-end streaming latency is critical for HTTP-based live video streaming. There are currently two new technologies in this domain: Low-Latency HTTP Live Streaming (LL-HLS) and Low-Latency Dynamic Adaptive Streaming over ...
Serving Live Videos with high reliability is challenging, not only from the perspective of deploying improvements on top of a distributed system but also from the perspective of defining correct measurements to capture reliability gaps ...
We present how Facebook’s unified Continuous Deployment (CD) system, Conveyor, powers safe and flexible service deployment across all services at Facebook. Conveyor enables services owners to build highly customized deployment ...
The cloud is becoming one of the most attractive ways for enterprises to store, analyze, and get value from their data, but building and operating a data platform in the cloud has a number of new challenges compared to traditional ...
Facebook Ordered Queue Service (FOQS) is a distributed priority queue service that powers hundreds of services and products across the Facebook stack. Facebook users have come to rely on its services to remain connected to their ...
Confluent Inc provides cloud based data stream platforms based on Apache Kafka. Running an open source product like Kafka on the public cloud offerings of Amazon, Google, and Microsoft offers an interesting array of challenges. This ...
Power outages cause the majority of unplanned server downtime in Facebook data centers. During a power outage, thousands of servers can go offline simultaneously for several hours, which can lead to service degradations. At Facebook, ...
BigSpring is a mobile first platform for lifelong skilling with measurable ROI. We use GraphQL to power our services. We would love to talk about how we use Jest to integration test our resolvers and other business logic built in our ...
Developing at speed and scale across Facebook’s many services requires testing frameworks that help developers iterate on features quickly and with minimal friction, while helping to catch bugs early. Learn why we’ve built our own ...
Attribution of reliability in a microservice architecture can be solved, and has been solved, in very different ways due to how services are cataloged across the industry. Our hypothesis at Lyft was that service catalogs can become ...
In 2013, eight years ago, Amazon Web Services revolutionized the data warehousing industry by launching Amazon Redshift, the first fully managed, petabyte-scale cloud data warehouse solution. Amazon Redshift made it simple and ...
Transitive Resource Accounting (TRA) is a system that builds on top of Facebook’s distributed traces platform, Canopy, with the goal of capturing end-2-end request cost metrics and attributing them back to the originating caller. This ...
At Twitter, hundreds of thousands of microservices emit important events triggered by user interactions on the platform. The Data Platform team has the requirement to aggregate these events by service type and generate consolidated ...
Facebook is undergoing a massive design shift in capacity management and service placement to scale the efficiency of our datacenter resources. At the core of this shift is the Resource Allowance System (RAS) that continuously ...
The Facebook cloud supports a variety of workloads including those which are CPU intensive, memory bound, I/O bound, latency sensitive, or a combination of these, on hardware that ranges from smaller single socket servers to load ...
Application specific hardware platforms play a crucial role in meeting the growing latency and compute demands of workloads like deep learning, content understanding and video encoding. However, it is challenging to operate these ...
BPF (eBPF) tracing is the superpower that can analyze everything, helping you find performance wins, troubleshoot software, and more. But with many different front-ends and languages, and years of evolution, finding the right starting ...
In this talk, we share some of the most exciting achievements of Flink at Alibaba in recent years, including two main topics: one is the architecture evolution of stream-batch unification; the other is the recent efforts to improve ...
Tectonic is Facebook’s exabyte-scale, datacenter-wide distributed filesystem. Prior to Tectonic, Facebook’s storage infrastructure consisted of a constellation of smaller, specialized storage systems. Blob storage was spread across ...
Alibaba Cloud offers a comprehensive set of storage services, including Object Storage Service (OSS), File Storage Service (NAS) and NoSQL Tablestore with high durability, high availability, high scalability and strong consistency. All ...
Optimus, our spare capacity leasing system, coordinates capacity allocations on millions of machines to improve global capacity utilization and meet fast growing business needs. Within Facebook’s infrastructure, spares are ...
Uber infrastructure broadly supports 3 kinds of workloads: stateless microservices, big data (batch) and stateful, each running on its own hardware silo. Morcor aims to reduce the cost of infrastructure through co-location of stateless ...
Azure Kubernetes Service (AKS) manages Kubernetes clusters on behalf of customers. AKS stays agnostic to the customer workload and manages the accessibility, performance, and reliability of these clusters without requiring full ...
This session will share the real-world lessons from reliability engineering work on the Exposure Notifications Server – A project from Google and Apple in an effort to slow the spread of COVID-19. The work from Google SRE ...
William previously worked at Netflix, and this presentation will highlight some of the strategies he used while working there. He has the permission of Netflix to discuss them at this conference. As companies grow and the number of ...
At AWS, we build systems using a variety of complementary strategies for maintaining predictable, consistent performance in the face of overload. In this talk, we describe techniques such as implementing layers of protection, ...
Facebook is made up of hundreds of heterogeneous services in geographically distributed data center regions. To reliably run, providing a sufficient amount of capacity for all sub-systems and services is crucial. However, understanding ...
We will be hosting a talk about our work on Virtualizing Consensus In Delos For Rapid Upgrades And Happy Engineers during our virtual Systems @Scale event at 11am PT on Wednesday, March 17th, followed by a live Q&A session. Please ...
We will be hosting a talk about our work on FlightTracker: Social graph consistency at scale during our virtual Systems @Scale event at 11am PT on Wednesday, March 17th, followed by a live Q&A session. Please submit any questions ...
Welcome to the third week of Systems@Scale – Spring 2021, Virtual Edition – featuring recorded sessions & Live Q&As with Maxim Fateev, Girish Joshi, and Dan Shiovitz.
We will be hosting a talk about our work on Workflows@Facebook: Powering Developer Productivity And Automation At Facebook Scale during our virtual Systems @Scale event at 11am PT on Wednesday, March 10th, followed by a live Q&A ...
Welcome to the second week of Systems@Scale – Spring 2021, Virtual Edition – featuring recorded sessions & Live Q&As with Chidambaram Muthu, Dan Danaila, and Sazzala Reddy.
We will host a talk about our work on Optimizing video storage via Semantic Replication during our virtual Systems @Scale event at 11am PT on Wednesday, March 3rd, followed by a live Q&A session. Please submit any questions you may ...
Join the @Scale Mailing List and Get the Latest News & Event Info
Code of Conduct
To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy