Re-Architecting the Mobile Network Stack @ Scale

Mobile @Scale is an invitation-only technical conference for engineers building mobile software and services that serve millions or even billions of people.

Rohil Verma

@SCALE SERIES: Mobile @Scale

YEAR: 2024

TAGS:

Additional Author: Amir Livneh

Meta’s user base has grown to billions worldwide. Meta’s mobile apps are used across network types, operating systems, and hardware, and we have found that standard mobile-networking components could not deliver the performance or iteration speed that we needed. We chose to build out an in-house network stack for mobile devices in order to deliver the best possible user experience and increase developer velocity.

Meta’s first in-house network stack met these goals, delivering user-experience improvements and protocol innovations. After nearly a decade in operation, however, we realized that the stack’s large binary size was limiting its adoption in several apps. We decided to rebuild the stack from the ground up, aiming to deliver improved performance with a fraction of the binary size and a vastly simpler library. We took on this big bet both to support a user base that was connecting from increasingly diverse network conditions and to reverse a trend of degrading developer velocity.

In this post, we will cover:

the motivation and history of Meta’s in-house network stack,
the challenges faced and design choices made while re-architecting the network stack,
the networking functionality that contributed to improved user experience, and
the impact of the new stack and what we plan next.

Why Meta needs its own network stack

Today, Meta’s mobile apps are used across networks ranging from 2G cellular networks in developing countries to 10 Gbps fiber in developed countries and on devices ranging from the latest iPhone to inexpensive devices running Android 5. To provide the best security and user experience for this diverse user base, most Meta apps use a custom network stack instead of relying on the built-in APIs provided by Android and iOS.

A custom network stack allows us to:

experiment with and introduce optimizations to improve user experience, including taking advantage of properties that are unique to Meta’s network infrastructure.
release bug fixes on a weekly basis instead of relying on OS updates. This is particularly important since many users do not regularly update their OS versions.
constantly improve observability so we can troubleshoot issues and identify opportunities for improvement.

First generation: Porting the server-side network stack to mobile

The first custom mobile-network stack built by Meta was based on the server-side network stack. Its main components were Proxygen, mvfst and Folly. Although it successfully met our goals for a custom stack, its binary footprint was large, over 2 MB. The size of the stack was due to C++ features such as exceptions and templates, as well as the fact that, given its server-side roots, scale, rather than binary footprint, was the critical priority in its design.

Meta’s apps make a constant effort to keep the app size small for a number of reasons, including to:

remain below the limit imposed by app stores to make the app downloadable over cellular networks,
make the app faster to download and update, and
minimize app start time.

As a result of its large binary size, a number of Meta apps chose not to bundle the custom stack. Their concern was that users with capped data plans or on low-bandwidth networks would struggle to download or upgrade a larger app. They compromised by using OS-built-in stacks or open-source alternatives, trading off performance for size.

Second generation: A mobile-first network stack

We set ourselves a goal to make the custom stack lightweight enough that all Meta apps would be able to use it. Based on discussions with the app teams, we decided on 600 KB (before compression) as an acceptable footprint.

Our preferred solution was to start with the battle-tested legacy stack and continue sharing code with the server, then incrementally reduce its size. We realized, however, that getting to our size goal would require drastic changes that would be challenging even without any external constraints, let alone when we have to balance the interests of the server side. We also realized there were already lightweight, open-source alternatives to the heavyweight libraries the legacy stack was using, and by using those alternatives instead of reducing the size of the large ones, we could minimize development time. We chose to base our lightweight stack on the open-source libraries ngtcp2, nghttp3, and libev.

To avoid the overhead of C++, we debated using C or a limited subset of C++ (precluding exceptions and templates, for example). We decided to use C combined with an in-house library that provides abstractions for needs such as collections, string manipulation, memory management, classing, and so on. With this choice of language, we used a combination of domain knowledge and past experimental data to select a stripped-down feature set that we believed contained no more than the most essential optimizations adding user value.

Bringing the stack to parity and beyond

At this point, we had a functional stack, including support for performance and security features such as connection pooling, DNS resolution and caching, zstandard decompression, certificate verification, TLS support, QUIC, and many more. Yet, a first test of the stack showed significant impact to user engagement as compared to the legacy stack, including regressions as far removed from the network stack as user app-start time. In addition to bugs, we were evidently missing important optimizations; however, we could not backtest every single feature in the legacy stack to determine which optimizations were missing.

So, which features were we missing, and how could we identify them efficiently through the many layers between product and the network stack?

*Figure 1. An illustration of the apps’ internal structure*

Methods of improvement

We used five methods primarily to improve the new stack:

Root-cause analysis: We attempted to connect a user-experience regression with lower-level metrics that could be linked to the network stack. We then identified the missing network stack feature needed to resolve the regression.
Bisection: We selectively enabled the new stack for individual features or network requests to identify requests that were correlated with regressions.
Telemetry: We added telemetry measuring the latency of the network request phases to isolate the area of the network where we were likely to be missing features.
Backtesting: We identified features present in the legacy stack that we thought were either likely to improve performance or significantly impact the lifetime of a request. We measured their impact via backtest and experimented with them in the new stack.
Local reproductions: We locally simulated user experiences, typically on degraded networks, to more rapidly identify the source of regressions. These reproductions enabled us to learn more quickly, with the trade-off of reduced precision, as they were not necessarily representative of the networking conditions under which product regressions occurred.

In practice, we usually employed a combination of these techniques to make progress in parallel, as it was challenging to connect product metrics with the network stack, telemetry could be noisy, and data collection took weeks due to the release cycle.

Below are features we added that helped improve user experience.

Product layer features

Upload-bandwidth estimation

When we first investigated an upload-quality regression, our hypothesis was that we were egressing upload data slower, leading to the product dynamically selecting a lower quality. A similar approach is employed to vary quality for downloaded videos. We observed, however, that the new stack actually improved upload latency, challenging the hypothesis that it had a network-stack bug.

Instead, we discovered that the upload product relies on a machine-learning model to select a bitrate for upload. This determination is done a few seconds prior to video upload and does not change. The model takes upload bandwidth as input, relying on a separate component to track the app-wide bandwidth. This component is dependent on measurements from the network stack. Unaware of this dependency chain, we had not implemented the network-stack feature responsible for providing measurements, leading to a small but meaningful regression to upload quality.

Upload progress indicators

While investigating an upload success-rate regression, all evidence indicated that the new stack was slower to egress data. We hypothesized that slower egress led to timeout-driven failures in degraded networks. Without clear evidence indicating the root cause, we embarked on a local reproduction on a degraded network. Though we didn’t notice any significant sources of latency, we did notice a small visual bug where progress circles always appeared complete in the new stack, as illustrated in Figure 2 below.

*Figure 2. Left image: Progress indicators began nearly complete, never incrementing. Right image: Indicators gradually incremented from an empty circle.*

We traced this bug, discovering a dependency between the upload product and the network stack, and fixed it, not expecting it to meaningfully impact any metrics. To our surprise, restoring the progress indicator significantly improved the new stack’s upload success rate. Drilling in, we discovered that the improvement was driven by a reduction in upload cancellations by certain user populations, who we surmised were getting frustrated with the visual indication of a “stuck upload.”

Network-layer features

Compression of proprietary HTTP/3 headers

Now investigating a navigation-latency regression, we dove into the performance of API requests. A successful navigation consists of building the UI and loading the content, with API requests determining the visible content. These requests are small, and they’re responsible, for example, for fetching a list of videos or a user’s friends’ profile photos. Until these requests complete, no media can be downloaded—navigation is blocked.

In telemetry, we noticed a latency regression in egressing the headers and body, and a corresponding regression in headers’ compression ratio. We saw that while we were using QPACK to compress the request headers, the new stack’s implementation was compressing only standard HTTP headers, not proprietary ones. Though HTTP headers make up a small amount of bytes, and custom headers even less so, interestingly, custom headers are large enough that compressing them impacts navigation performance.

HTTP/3 egress prioritization

Investigating another navigation latency regression, we again dove into API request performance. This time, we backtested a feature scheduling client egress in accordance with HTTP/3 priority, which was present in the legacy stack but not the new one. We hypothesized that scheduling request bodies in accordance with priority may improve API request latency, similar to the impact that header compression had. However, though the backtest showed that egress prioritization improved app-start performance, it did not show improved navigation performance.

As we studied the client egress implementations in both stacks, we noticed a subtle difference between the two implementations. By default, the legacy stack egressed requests simultaneously, while the new stack egressed requests sequentially. This meant that the backtest measured the impact of different urgency buckets but did not measure the impact of scheduling traffic simultaneously. In practice, since many API requests shared the same urgency bucket, the decision to egress simultaneously or sequentially was meaningful to API traffic scheduling.

We hypothesized that by egressing request data sequentially, the new stack was blocking navigation-request egress behind the egress of other API requests that had started earlier but were now non-critical to the user’s navigation experience. We modified the new stack’s default behavior to egress requests simultaneously, and we implemented egress prioritization. Egress prioritization resolved the navigation regression, improved app-start performance, and enhanced other app-experience metrics.

TCP Happy Eyeballs

We spent a considerable amount of time investigating regressions in app-start abandonment and ad impressions. We were unable to identify any network-level regressions, such as latency or failures, that could explain these regressions. After testing several hypotheses, we suspected that the culprit might be the absence of Happy Eyeballs. Happy Eyeballs was supported by the legacy stack, and implementing it in the new stack was in our backlog but not given a high priority. We implemented and experimented with Happy Eyeballs for TCP and QUIC, which addressed these and other regressions. In hindsight, we underestimated the impact of this feature. Specifically, we overlooked the consequences of relying on the long OS TCP connection timeouts when IPv6 is unavailable, as well as the prevalence of networks affected by IPv6 connectivity issues. Better telemetry also would have helped us track down the root cause of the problem more quickly.

We initially rolled out Happy Eyeballs for both QUIC and TCP. We then hypothesized that we could simplify the new stack by removing Happy Eyeballs for QUIC and keeping it only for TCP. This was because there was already a fallback mechanism in place that triggered a TCP connection attempt when QUIC took too long to connect, which we expected would also be triggered when IPv6 was unavailable. Additionally, we were aware that Chromium had implemented Happy Eyeballs only for TCP. After testing, we confirmed that this simplification could be made without any noticeable impact on users.

Impact and looking ahead

After accumulating these fixes and many more, we have now launched the new stack to all of Instagram’s billions of users. The new stack is less than a fifth of the size of the legacy stack it replaces and has meaningfully improved app-performance and business metrics.

Going forward, we plan to roll out the new stack to all Meta mobile apps. In parallel, we are continuing to modernize by integrating the mvfst QUIC library.