TOPIC:

Improving Audio Quality for Calling across Meta’s Family of Apps

Designed for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operation of large-scale RTC systems has always involved complex engineering challenges and has attracted much attention in recent months given the explosive growth of RTC in these unprecedented times.

Sriram Srinivasan

Hoang Do

TOPIC: Mobile, Video and Web

@SCALE SERIES: RTC @Scale

TYPE: ARTICLE

YEAR: 2024

TAGS:

Introduction

We are excited to announce the rollout of Beryl, our new, advanced, echo-and-noise-suppression solution to improve audio quality when making calls using WhatsApp, Messenger, Instagram, and Facebook on Android devices—a rollout benefiting billions of users. Following the rollout in December last year, we observed a significant reduction in user complaints about echo, confirming the 25% reduction in echo and 30% improvement in double-talk (or interruptibility) we observed in our lab tests.

The Echo Problem

Have you ever heard your own voice echo when you are on a call? Or heard distorted audio that mysteriously follows the cadence of your speech whenever you talk? This is the well-known problem of acoustic echo. For example, when Alice calls Bob, Alice’s voice plays out of the loudspeaker in Bob’s phone. Bob’s mic picks up Alice’s voice, and in the absence of any signal processing, this audio signal gets sent back to Alice, who hears her voice echoed back. Calling apps include some form of acoustic echo cancellation (AEC) that cancels out Alice’s voice from Bob’s mic input. Apps may either implement AEC in software or leverage the built-in AEC available on the phone, when available. Not all AECs are created equal, however, and when they fail, you hear your own echo.

Figure 1: The echo problem

In the above example, one way to suppress echo is to attenuate Bob’s microphone signal whenever Alice is talking. While this ensures Alice will never hear her own echo, this also means Bob can never interrupt Alice when she is talking—the well-known walkie-talkie effect. The goal of a good AEC is to suppress echo and allow both participants to talk at the same time (known as “double talk,” in common AEC parlance).

Across our family of apps (FoA), including WhatsApp, Messenger, Instagram, and Facebook, for calls on mobile phones where built-in hardware AEC does not provide satisfactory quality, we previously used an algorithm referred to as AECm from the open-source WebRTC stack. In contrast to a computationally heavier version from WebRTC called AEC3 that runs on browsers and desktop apps, AECm is a lightweight algorithm suitable for mobile phones.

Echo cancellation is a complex problem affected by a number of factors such as microphone and loudspeaker characteristics, how close they are to each other, the nature of the material that separates them, and the coupling between them. Other factors such as loudspeaker volume, how reverberant the user’s acoustic environment is, and the relative volume of the loudspeaker signal, and the local talker’s voice also play an important role. At our scale, where billions of users engage in calling across thousands of device models in varied acoustic environments, we found WebRTC AECm was not providing good enough quality across our diverse mobile user base.

Beryl, our light-weight echo removal solution

There is a lot of excitement in academia and our industry around machine learning (ML)-based echo removal. At Meta, we realized the value of ML early. On devices that can afford higher computational complexity, we previously launched a machine learning-based approach that suppresses echo, preserves double talk, and also suppresses various types of background noise. This solution is available for all our Messenger desktop users on Windows and Mac. At the same time, we also needed a baseline solution with wide reach that is lightweight in terms of CPU and memory resources and yet provides significantly better quality than the current state-of-the-art solution, even on the lowest-end devices. To solve this problem for our users, we invested in writing our own in-house, light-weight AEC, codenamed Beryl.

Below are two short recordings from live calls on Messenger for Android using (a) WebRTC AECm and (b) Beryl AEC made using two phones, popular among our users, both without hardware AEC. Each recording is a stereo audio file, where the left channel is the AEC output from one phone, and the right channel is the far-end reference audio signal coming from the other. One can hear that the Beryl AEC output does not have the residual echo and noise like that of the AECm.

A call using WebRTC AECm

A call using Beryl AEC

Figure 2: Beryl architecture and building blocks

Figure 2 illustrates the main signal-processing blocks that form the core of Beryl. The loudspeaker reference signal r(n), which contains Alice’s voice, and the microphone signal z(n) that contains a mix of Bob’s voice, background noise, and the loudspeaker signal, are fed as inputs to a delay estimation and alignment block. This ensures Bob’s voice signal is time aligned across the inputs r(n) and z(n). A notch filter removes any hum present due to powerline interference, following which the signal is converted to the frequency domain for subband processing. Beryl supports signals sampled at a rate of up to 48 kHz. Beryl Lite, intended for low-compute mobile devices, processes the signal in frames of 10 milliseconds (ms) with an algorithmic latency of 10 ms. Beryl Full for desktops and high-compute mobile devices uses 5-ms frames with an algorithmic latency of 15 ms. Beryl also supports multi-channel echo cancellation (multiple capture and playback channels), which is crucial for scenarios such as stereo playback and capture.

Both in its Lite as well as Full modes, Beryl’s improved double talk is largely due to its support for a linear adaptive filter. In contrast, while WebRTC’s AEC3 leverages a linear filter, AECm does not, which negatively affects double-talk quality. Beryl makes minimal assumptions on critical aspects such as the relative delay between the loudspeaker reference and microphone signal, coupling between the loudspeaker and the microphone, and background noise. These are important to generalize over a large set of unknown devices, which is very different from tuning an AEC for a single known device. Beryl assumes that the echo path can dynamically vary and has mechanisms to quickly adapt to changes in the echo path. The reverb estimator module estimates the energy that is not part of the direct path between the loudspeaker and microphone, which the non-linear echo suppressor component then uses to suppress residual echoes that remain after the linear filter. Then, before the signal is converted back to the time domain and is gain-normalized, the noise suppressor removes any stationary background noise.

While we have described the main system components, which will be familiar to audio engineers, there are several other smaller components and ingenious optimizations that together produce high-quality output.

Evaluation and Results

For a long-term investment of this nature, it is critical to build confidence through intermediate milestones and evaluations. Before we could deploy Beryl in our apps and measure its benefits through A/B testing, we needed to evaluate it in the lab. This required building a representative test data set that reflected the diversity in device and acoustic conditions our users experience. We never collect audio data from our apps, so we had to rely on crowdsourced data collection similar to the approach taken by other companies for the ICASSP 2022 AEC Challenge. Our data set contains about 3000 audio clips, or 40 hours of speech and music, with loudspeaker and microphone signals recorded from real devices covering 7 languages, and sampled at both 16 and 48 kHz. These signals were then processed through our algorithms under test. The outputs were assessed using the ITU P.831 protocols for evaluating AEC algorithms.

Table 1 shows the mean opinion score (MOS) results for Beryl Full and Beryl Lite for both speech and music signals under challenging conditions for echo removal, and compared to the WebRTC solutions. We measured the (absence of) echo annoyance (on a 5-point scale, where higher is better) and the quality of the primary talker’s voice signal when the remote party is also talking, aka double talk (also on a 5-point scale where higher is better). As a majority of our user base is on mobile platforms, we are particularly excited about Beryl Lite’s 26-29% improvement in echo annoyance scores and 38% improvement in doubletalk scores compared to AECm. It is important to note that absolute scores are dependent on the nature of the test data used in the evaluations.

Importantly, Beryl achieves these significant quality gains with only a modest (less than 7% relative) increase in CPU load, which was confirmed not to affect the user experience through our production A/B tests.

P.831 Measure	Explanation	How much better is Beryl Full compared to WebRTC AEC3	How much better is Beryl Lite compared to WebRTC AECm
Echo annoyance during single talk	Bob is silent. Alice is speaking. Alice should not hear themself back	37%	26%
Echo annoyance during double talk	Both Bob and Alice are speaking. Alice (or Bob) should not hear themself back.	39%	29%
Near-end quality during double talk	Both Bob and Alice are speaking. Bob‘s voice comes across clearly to Alice (and vice-versa)	14%	38%

Table 1: Subjective quality of Beryl Lite and Full in comparison with WebRTC solutions

Once the algorithms were finalized and integrated into our apps, we performed an A/B test in production to validate our lab results before rolling Beryl out to all our users. We observed a significant improvement in our engagement metrics as well as user-reported echo problems. Beryl is now rolled out to all calls across our major apps and powers billions of calling minutes each day.

Team

We created Beryl in collaboration with an external vendor, SMPL, and would like to thank everyone who contributed to this project. At Meta, our team includes Hoang Do, Sureshbabu Ramalingam, Puneet Rana, Yun Li, George Gao, James Luan, Bikash Agarwalla, and Sriram Srinivasan. At SMPL, we worked with Phil Hetherington, Shree Paranjpe, and Karsten Sorensen.

SUBSCRIBE TO @SCALE

TOPICS

Data, Systems and Networking Dev Tools and Ops, Privacy, Sustainability and Performance Fighting Abuse and Security Machine Learning and AI Mobile, Video and Web