Messaging at Scale

Mobile @Scale is an invitation-only technical conference for engineers building mobile software and services that serve millions or even billions of people.

Akshay Patel

Andrew Cuneo

@SCALE SERIES: Mobile @Scale

YEAR: 2023

TAGS:

It’s no secret that we have a lot of messaging experiences at Meta, from dedicated apps such as Messenger or WhatsApp to our feed-style apps such as Instagram or Facebook, to the full family including Facebook Lite, Instagram Lite, and even to Metaverse-focused experiences like the Meta Quest companion app.

Outsiders often assume that there is a single codebase powering messaging in all these places, but the truth is more complicated: While there have been many efforts over the years to increase convergence with limited success, the default—for reasons we’ll speak to below—is that each app primarily builds and iterates on an independent codebase.

In 2018, Messenger engineers initiated a project called “msys” with the goal of changing this status quo and finally building a foundational messaging layer that could be shared across the family of apps. In the last five years, this effort has seen both successes and challenges, reaffirming many benefits of sharing mobile-product code while also revealing unexpected obstacles that complicate the trade-off.

In this blog post, we walk through this journey, covering technical strategies, key results, and our most notable learnings.

Recent Mobile Messaging Code-Sharing History at Meta

Counting the number of messaging experiences at Meta requires not just looking at how many apps there are, but also multiplying that by the number of messaging surfaces (such as Android, iOS, MacOS, WWW, and so on) found in each app.

Building, stabilizing, and maintaining shared features for each surface offends engineering common sense. Yet, it is the default for many larger applications at Meta, and there are some good reasons for this: Connected to Meta’s “Move Fast” culture, it optimizes for creativity by maintaining a low barrier to trying ideas. Decoupling also offers the advantages of independent development: Well-funded teams can progress on the specific critical features required for each surface in parallel.

Still, in many key cases, the freedom teams had to create their own implementations resulted in divergences with little value, causing complications for both the teams themselves and the infrastructure teams supporting them.

Consider Messenger iOS and Android, which for many years shared the same contract with the server but had completely different architectures for processing updates, largely due to having been developed by separate teams.

For Messenger iOS, server payloads were held in memory and shown in the UI before being persisted to disk. This architecture delivered low sync latency and featured a simple, reliable pipeline, from receiving a message to updating the UI. However, the system also suffered from a persistent class of missing message bugs that would show up only on app restart.

In contrast, Messenger Android was built with server payloads persisted to disk before being rendered, thus ensuring consistency across app restarts, but at the cost of latency. This system also had its own issues, including some bad UI jank caused by the large number of steps between receiving a push and rendering it on screen.

In net, the team spent a lot of time and energy stabilizing two completely different sync solutions and also shifted some runtime and maintenance cost to the server, which had to help investigate these issues.

Looking past our larger apps, there were some people finding ways to reuse the rich messaging-client codebase: Smaller apps, constrained by the size of their engineering teams, were sharing entire apps and surfaces. For example, Workplace Chat was a relinked Messenger application with a different icon and a few swapped-out components. This strategy enabled them to launch a polished product with a small team while keeping their app updated with the latest features. But it also meant that they included code for lots of features that were not available in their application—and their activity in the shared codebase caused a few Messenger incidents as well!

So, when it came to code sharing, neither the larger apps, which shared minimally, nor the smaller products, which shared maximally, seemed to be getting it quite right.

Addressing the Growing Burdens of Code Duplication

It’s possible to find ways to keep a system like this afloat, but the Messenger team saw an escalation in both the severity and quantity of challenges:

Private sharing, and specifically messaging, was one of the fastest growing areas of exploration at the company, and we were entering a phase where teams would be eager to experiment with new messaging content types, audiences, and use cases.
Recent implementations of ambitious, common social features across the family (e.g,. Stories) had been slowed by the need to allocate dedicated engineering resources for each messaging surface.
Keeping consistency and baseline quality was challenging as on-calls battled different types of bugs on each surface and teams would often sprint and wind up with code that “works” before moving on to their next bet.
The complex server infrastructure required for messaging had the added responsibility of maintaining backward compatibility across all these surfaces, leading to considerable strain on that codebase.

We had conviction that centralizing the majority of messaging business logic into a single shared implementation would be the key to creating consistent, high-quality messaging experiences across the entire family of apps. Moreover, we believed that streamlining business logic in this manner was likely to reduce the time required to launch new features, as we’d only have to implement them once.

With this in mind, the Messenger team concluded that it was the right time to get serious about finally building a shared messaging layer for the company.

Designing the Platform

Leading up to this time, synchronization and consistency issues had been some of the most persistent and challenging problems in the Messaging space, so it stood to reason that a sync engine should be at the center of the platform, and that engine should be database first.

Similarly, scared by the fundamental divergences we’d seen between iOS and Android over the years, we focused on building a truly singular solution in cross-platform code.

Next there was a question of which layer to share at, which was much trickier. In building a mobile platform, people intuitively start with an API, creating what we’ll call here a service: a system that can be interacted with via a series of functions (such as authenticateUser, sendTextMessage, sendPhotoMessage, blockContact, and so on).

To validate this idea, we looked closely at how messaging features are typically built and concluded that they tend to be highly interrelated. That is, when you’re building a feature, you end up interacting with lots of different messaging data and often affecting the behavior of other features.

To illustrate this, consider common messaging flows, and notice that they depend heavily on “side effects”: When a message is sent, we expect that the inbox is reordered, the message snippet for that chat is updated, and the message is bolded on the recipient’s side. This requires coordination of the two main screens (inbox and chat view), that all features participate in. Popular features, such as messages that expire after a certain time or user input, tweak the bumping behavior in different ways in different apps; for example, does an expired message ‘unbump’ the chat?.

With all this in mind, we iterated to a model where msys would function as a headless messaging app that consolidated all feature implementations within it. This approach aimed for the msys layer itself to create view models that bind directly to the UI.

To achieve this, msys would need to provide a language and tools for engineers to write their messaging business logic directly within the platform, making msys more like Unity or the mobile OS itself than typical mobile SDK. If done correctly, we hoped to generate performance wins while allowing all business logic to be shared across the iOS and Android platforms.

The msys Ecosystem

The msys ecosystem was developed and eventually shipped as the underpinning of our Messenger iOS rewrite that launched in 2020. The system included a powerful and flexible syncing engine at its core. On top of this, all features are written in a database-centric architecture built upon SQLite. In the setup, all feature state is expressed in relational tables managed by the sync system—either populating remote data or pushing client updates back to the server. The UI operates by executing queries against these tables and rendering the view models they produce.

For this system, mutations to handle the “side effects” of sending a message are straightforward:

Reordering chats after syncing a bunch of updates can be expressed as:

UPDATE chats SET timestamp = now() where chat_id = 1234;

Marking a chat as unread can be expressed as:

UPDATE chats SET is_unread = 1 where chat_id = 1234;

Updating the message snippet of a chat can be expressed as:

UPDATE chats SET snippet = “Message 4” where chat_id = 1234;

So far, this may seem within the realm of the normal ORM systems that client engineers are familiar with, such as CoreData on iOS. But msys took this to an extreme, aiming to build the majority of the application in the data layer. While unusual, this approach optimized for sharing the maximum amount of logic to all surfaces.

When writing SQL by hand, we quickly ran into hurdles, finding that it lacks type safety, version management, and an ability to easily compose sequential statements unless embedded in another language. To that end, we introduced CG/SQL (CQL), a bespoke DSL (domain specific language) for making it easy to write SQL logic in an imperative, type-checked way and that compiles to dependency-free, performant C.

With CQL, the logic for receiving a message is expressed as:

create procedure receive_message(chat_id long int not null, message text not null, message_timestamp long int not null) {

update chats set timestamp = message_timestamp where id = chat_id;

update chats set is_unread = 1 where id = chat_id;

update chats set snippet = message where id = chat_id;

}

Once we had a lot of logic expressed in this system, we worked backward to create an SDK layer by building canonical implementations of key functions and then generating stubs into the language of choice for the given surface.

With every feature expressed relationally—and many bigger applications choosing to keep forked versions of core procedures—the amount of CQL and data logic grew rapidly and we reached hundreds of SQLite tables and thousands of CQL procedures across the Meta codebase.

Case Study: Bringing msys to Facebook

Messaging has long been an integral part of the Facebook app experience. In 2014, we split Messenger into its own standalone app and removed it from the Facebook App. By 2019, it was reintroduced for certain use cases. This was achieved with a completely custom implementation built on top of a FB-specific messaging data layer, and was fully distinct from the Messenger codebase.

Leveraging their own data layer gave FB developers flexibility and agility, but also suffered in terms of reliability and keeping up with the flow of new features in Messenger.

When the company pivoted to end-to-end encryption, the custom data layer became exponentially more expensive, and the clear choice was to bring msys into Facebook to power its UI on both mobile platforms. To make the experiment feasible, the Facebook team decided to push msys data into its existing ViewModel layer, allowing their UI to be abstracted from the data source powering it.

Ultimately, after two years of porting features, interpreting tests, and chasing down regressions, we were able to successfully ship msys in both the Facebook Android and iOS applications.

The journey of scaling msys to Facebook was a learning experience, and it was an early sign that msys wasn’t a perfectly adapted system for all use cases in the family of apps. Nevertheless, trade-offs exist for a reason, and in the end, the massive effort demonstrated some key performance improvements over the existing solution, with time-to-interact (TTI) for Stories and inbox showing roughly 25% improvements on average.

Equally importantly, this effort set the stage for UI-level consolidation. As Messenger iOS and Facebook iOS were sharing the same data layer, it was now feasible to also share chat-view UI code. As part of the LightSpeed project in 2018, the Messenger chat view had been re-written—with modularity at its center. With this chat view, the core system was fairly isolated from features, and so we were able to exclude or include them declaratively at build time, reducing bloat from Messenger features that weren’t shipped in Facebook and making it easy to create Facebook-specific integrations.

Since a lot of the hard work had already been done, porting this highly polished chat view into the Facebook app was much smoother than integrating msys itself. We completed the initial code in just six months, and this second effort delivered impressive results: a huge increase in topline engagement metrics for Facebook Messaging.

In many ways, this is a more refined version of the original strategy smaller apps were using in the times before msys: shipping quality experiences with high leverage by employing large amounts of shared messaging code.

What Went Wrong?

A key test for any platform is scaling to the first few customers. Almost immediately, we were surprised that even though msys was designed to have business logic written only once, many partner apps began creating their own CQL routines, even for common logic such as rendering a list of messages. The barrier of creating and maintaining their own multi-thousand line query should have scared them off, but partners showed that the need for flexibility and quick customization was dominant in many cases.

In many cases, we needed to migrate fully matured and complex surfaces, such as the chat view or inbox, to the msys ecosystem in their entirety before conducting A/B testing. This worked against a company culture that prized performing migrations incrementally so each step could be vetted for app health and user engagement impact. This type of all-or-nothing approach significantly increased the challenge of unseating stable experiences that had undergone years of refinement.

Next, as we continued to add new schema artifacts like tables, columns, and indices, we ran into significant problems with the scalability of our SQLite database design. With the entirety of the Messenger feature set expressed in a single, monolithic SQLite database, each incremental schema change translated into more initialization time, more binary size, more complex relational dependencies, and increased tech debt from unused features. As such, we have continued to invest in workarounds, from modularizing our schema bundles to lazy-loading our schema artifacts, which have helped us mitigate some of these challenges.

Finding a Balance

Despite the challenges, we see substantial benefits from our consolidated codebase every day, with better consistency and efficiency than we had pre-msys. Thanks to msys, common sync issues—such as losing state after restart or missing key real-time pushes—are extremely uncommon. Likewise, shared per-platform chat views are key leverage multipliers for engineers who ship features such as Community Messaging and Generative AI across Facebook and Messenger.

However, sharing across OS platforms required creating a system with its own idioms. While good for codebase consolidation, it forced high-achieving mobile engineers to ramp onto a new ecosystem. This created a huge tax, and in the end, the added value of consolidated implementations of business logic was not always enough to overcome it. Mobile engineers at Meta move around between projects and teams a lot, so the cost of learning a new technology—from the boundaries of possibility to new syntax and tooling to new debugging techniques—adds up quickly.

We’ve seen this tension manifest as an inherent conflict between sharing one implementation everywhere and shipping features quickly. Since its inception, msys has prioritized high-quality shared features over shipping quickly. Looking forward, we realize that some amount of fragmentation in the codebase is not just expected, but also desirable. Finding a balance requires knowing when to prioritize one over the other and ensuring we support divergence gracefully.

What’s Next?

In the spirit of finding this balance, msys will shift toward a core-and-extensions model: internalizing its key components while delegating feature extensibility to the application layer. It has been clear since its inception that, in the hands of an expert team, msys provides a solid base for (eventually) shipping quality messaging experiences. To take the next step as a platform, our goal is to double down on the parts of our strategy that have worked well for us: including our sync engine, powerful SQLite-oriented toolset, and progress in creating a UI platform. At the same time, we will work to boost developer agility by investing in several key areas:

Increasing compatibility with non-relational data

By allowing more features to opt out of relational representations, we can reduce schema growth and enable quicker iteration. We want to increase availability and usage of blobs (such as JSON) as a way to reduce our overall schema size and help engineers stay in the mobile stacks they love.

Moving to a federated storage model

In the same spirit as microservices, we are moving to a multi-database model that should address increased feature isolation. This model will also provide more control over what we bootstrap and when, reducing unintended impact on unrelated systems from each change. Clearer feature boundaries will enable better contracts between systems and increased concurrency across our platform.

Focusing on how to extend core messaging scenarios outside the data layer

As messaging apps at Meta mature and become more distinct, the need to support different advanced feature sets such as message expiration is becoming paramount. We plan to explore tooling that allows Meta engineers to customize this kind of behavior while directly re-using core queries (instead of forking them).