We will be hosting a talk about our work on FlightTracker: Social graph consistency at scale during our virtual Systems @Scale event at 11am PT on Wednesday, March 17th, followed by a live Q&A session. Please submit any questions you may have to firstname.lastname@example.org before the event.
Serving the social graph for billions of users places large demands on Facebook’s online serving stack. Dynamic, personalized experiences like News Feed rely on efficient and highly available data stores serving billions of reads and millions of writes per second.
Historically, TAO, a write-through cache on top of a sharded database, served the majority of this workload. It exposed an object and association (i.e., nodes and edges) data model and simple query APIs. When end users interact with the Facebook mobile app or website, requests reach web servers where the product logic lives. Each web request may perform hundreds to thousands of reads and many writes to TAO (Figure 1). TAO used to provide Read-Your-Writes (RYW) consistency by write-through caching–routing reads and writes for a single user to the same cluster. But this approach of relying on fixed communication paths grew increasingly problematic due to its reliability drawbacks.
Figure 1. User devices send web requests to Facebook web servers, each of which sends many read and write queries to TAO.
In addition, the social graph ecosystem expanded over time. Products moved from directly accessing TAO to issuing more complex read queries in an expressive query language, such as filtering, ranking across multiple edges, or edge list intersection. It is inefficient or even impossible for TAO alone to serve all the complex queries. We thus built global secondary indexing services to optimize for complex queries. The indexing systems leveraged asynchronous replication to achieve low latency, high availability. It did not, however, provide Read-Your-Writes consistency. The lack of RYW consistency made it challenging to adopt new indexing use cases or transparently optimize queries.
To overcome this challenge, we developed FlightTracker: a set of APIs and services which manage consistency across the social graph. FlightTracker provides uniform RYW guarantees for TAO and global secondary indexes. FlightTracker operates independently of data stores. It facilitates decoupling and simplification of services, while preserving efficiency, latency, and availability advantages. Let’s look at an example.
A hypothetical example
To understand the challenge we faced without uniform RYW consistency, consider a hypothetical graph consisting of USER nodes and SONG nodes. Songs may have attributes like genre or title. Users may have associations to songs (like <USER Luna, ENJOYS, Song D>).
Figure 2. The subgraph of a hypothetical application.
A hypothetical application may wish to filter songs in a given genre that a user’s friends enjoy. Or, show which songs a given user enjoys for a single genre. For a small list of songs, one can answer this latter query naively by first fetching from TAO all ENJOYS associations and their attached SONG objects and then filtering based on the desired attributes. For larger song lists spread among many database shards, a dedicated global indexing service can answer such queries far more efficiently.
Care must be taken when offloading queries to separate indexing services if they do not provide the same consistency semantics as TAO. Consider a user, Luna, who recently marked that she enjoys Song 200. Querying the song list from TAO would ensure that the result reflects this recent write. An index without Read-Your-Writes guarantees may be stale, and omit the newly enjoyed song when queried. This can lead to confusing behavior for developers and users who observe stale results. What’s more, it limits our ability to transparently optimize queries.
FlightTracker solves the potential stale read problem by tracking metadata of recent, in-flight writes. When reading from a data store, this metadata specifies fresh results to be retrieved or included. We refer to a set of write metadata as a Ticket and implement a Ticket-inclusive read API for each data store: a contract that ensures a read reflects the effect of the writes in the Ticket. Combined, these primitives provide uniform Read-Your-Writes consistency across any backend that can provide a Ticket-inclusive read API.
Astute readers may wonder why we picked RYW as default instead of choosing something stronger or what makes FlightTracker novel and interesting. Before we dive into the details, we’ll first share some of our reasoning and philosophy.
Why Read-Your-Writes as default?
Read-Your-Writes consistency serves as an intuitive default for both application developers and end users. Data stores that do not guarantee an application to see its own writes are difficult to develop against. Similarly, not guaranteeing a user would see his or her own writes is difficult to use for interactive applications. As a result, we want to extend the boundary of RYW to end users. No matter which device an end user chooses to interact with Facebook. No matter how many web requests a user triggers. No matter from which data store the web requests query the social graph.
Stronger consistency models* greatly simplify development and usage, but also constrain implementation. For our read-optimized workload, the ability to serve most queries locally even with local replicas a few seconds stale is essential. In contrast, RYW allows us to utilize a stale replica by adding fresh versions of only a limited set of writes (“your” writes). Since applications are used to concurrent writes, we also always have the option to return a newer version of the data, i.e., providing at-or-after semantics. These two properties combined gave us great flexibility in our implementation.
Our goal is thus not to provide the strongest consistency possible for every application and all social graph data. Rather, we aim to provide a reasonable default and options to obtain additional consistency guarantees for select applications.
*Popular examples include linearizability (for which all writes missing from a stale replica need to be visible before serving a read) or causal consistency (for which most writes need to be visible).
Why managing consistency is challenging for a social networking workload
Providing consistency guarantees for read-optimized systems while maintaining low latency and good cacheability boils down to implementing a staleness check to assess if a cache or replica can serve a read query with its local data. This staleness check must be: (1) Local, avoiding network communication in most cases. (2) Highly granular, so that few queries generate extra work due to false positives from the checks. And (3) conducive to incremental repair, enabling the extra work to find fresh data to be reused for subsequent queries. Importantly, even in systems using synchronous quorum writes, single-replica reads need local staleness checks.* As we will see, this can be straightforward to implement for caches like TAO, yet far more complicated for global secondary indexes.
*For example, Raft followers or Paxos acceptors might have no knowledge of a write committed by the leader if they were not part of the commit quorum. Logical and physical timestamps, such as Hybrid logical clocks or Spanner’s TrueTime, provide a simple and scalable way to determine staleness—the local data is sufficiently fresh if timestamped higher than the desired read. Unfortunately, neither approach is granular or conducive to incremental repair. If the local store falls ten seconds behind the desired read timestamp, for example, it cannot service any queries until it has processed all of the missing writes.
Data store heterogeneity
Over the years, our social graph ecosystem expanded to different data stores, for example, our indexing systems, or ZippyDB for write-optimized sub-graphs. By design, these data stores possess different strengths and leverage different internal protocols. Loosely-coupled (if at all), they often rely on asynchronous replication for availability and latency benefits. We prioritize retaining these benefits while providing uniform consistency across these data stores, especially since loose coupling also promotes organizational scalability and flexibility.
Specifically for RYW, our desire to implement user-centric guarantees means that we experience intra-session concurrency at several levels. A single web request issues TAO reads and writes in parallel. A single browser or mobile app can issue many web requests in flight at once. A user may even be accessing Facebook simultaneously from multiple devices.
FlightTracker observes that application developers do not expect concurrent web requests to communicate with each other. A common mental model is that concurrent requests execute in a random order and possibly interleave with each other. Thus, visibility from one request to the next is only assured if one request finishes before the other starts. This observation led us to the following relaxation of the RYW guarantee, providing much-needed flexibility for handling intra-session concurrency.
A read to the social graph observes all writes done by the same end user within the same web request or in previously completed web requests.
To address the above challenges, FlightTracker decomposes the consistency problem into (1) collecting writes that need to be made visible in the form of a Ticket, (2) storing per-user recent write metadata in a dedicated service, and (3) implementing Ticket-inclusive reads for individual data stores. Let’s take a closer look.
Tickets encapsulate recent write metadata. Minted by each data store, they contain key and versioning information needed to coordinate consistency. Consider a write which updates a single SONG object to change an attribute. TAO may produce a Ticket that contains the ID and latest version of that object.
To use this Ticket, subsequent reads to this object would send this Ticket along the read query. We call this type of query a Ticket-inclusive read. For example, consider a stale TAO cache replica that only has version 1 of this object. To serve this read, the replica determines its cached value is stale, and fetches the updated value from an upstream database. We refer to this process as a consistency miss. It functions like a regular cache miss but only to ensure consistency. Subsequent Ticket-inclusive reads to the same version can return the newly cached value, i.e., the result becomes cacheable. We found the term “consistency miss” helpful for system monitoring and accounting, general communication, and building the mental models of the engineers on our team as well as on the data store teams.
Extensible and data store-specific, the internal representation of a Ticket can utilize any available versioning primitive–as long as the version is comparable and increasing. In production, we typically encode Tickets with key information, versioning primitives, and their schema type (such as USER or SONG).
Our read-heavy workload means that Tickets need to be small. Notably, the flexibility in representation allows us to make tradeoffs between Ticket size and efficiency of a Ticket-based read. For example, encoding Tickets with a per-shard sequence number 100 allows us to merge multiple writes on a single shard into one entry, making the Ticket more compact. At read time, this comes at the cost of lower selectivity: the per-shard Ticket semantically represents a set of all writes 0..100 on that shard. Thus, a higher chance exists that the local replica is considered stale.
Tickets hide these details from application developers and present an opaque token interface, offering basic functions such as merging (i.e., set union of write metadata), containment checks, and serialization. Most commonly, applications simply merge Tickets as they do writes. They then include this merged Ticket on subsequent reads. This flexibility allowed us to extend Tickets to cover other data stores like ZippyDB and evolve existing systems like TAO to new primitives without changing the API.
The FlightTracker service
Assuming data stores implement a Ticket-inclusive read API, providing read-your-writes consistency amounts to supplying the “proper” Ticket at read time. Since we want to extend the RYW boundary to end users, the Ticket needs to contain recent writes done by an end user. This Ticket can be attached to all subsequent reads to ensure the user observes their recent writes.
Intra-session concurrency renders accumulating a user’s writes locally in a single web request insufficient. FlightTracker solves this problem by providing external, per-user storage of Tickets. It tracks in-flight writes for every user. The social graph client library sends metadata of recent writes to FlightTracker along with the user ID. At the start of a web request, a merged Ticket gets fetched from FlightTracker by user ID (and merged with subsequent writes during that request). This guarantees visibility of writes done in previous web requests. FlightTracker thus functions like a simple key-value store, with user IDs as the key, and Tickets as the value.
We deploy FlightTracker as an in-memory, replicated service. Since our asynchronous update pipelines have a max latency of around 60 seconds, FlightTracker need only store the last 60 seconds worth of writes, limiting the working set dramatically. Empirically, a replication factor of 3 provides sufficient redundancy for the small working set, as long as those replicas are placed in independent failure domains. FlightTracker relies on ShardManager (partitioning the user ID space into shards) for intelligent placement and load balancing.
A simple single-round quorum protocol drives replication for FlightTracker. The client library sends writes to all replicas and acknowledges success if a majority responds. Reads merge the Tickets from a majority of replicas. Notably, the FlightTracker service does not require strong consensus or even a two-round protocol. We can merge Tickets in any order and can safely include additional write metadata in the returned Ticket without violating the overall RYW consistency.*
*For a nuanced description of FlightTracker replication and explanation, please see Section 5.1 of the FlightTracker paper.
The FlightTracker service identifies the set of recent writes a read might care about. For Read-Your-Writes, this set is the recent writes done by a given end user (our definition of “your”). Each data store takes on responsibility for serving Ticket-inclusive reads, ensuring the read result includes all writes in the attached Ticket.
Handling these Ticket-inclusive reads typically consists of three steps: (1) filtering for relevance, (2) checking for inclusion, (3) and handling local staleness. The first step filters out writes irrelevant to the read. For example, if the read asks TAO for SONG 1005, only writes which updated the object 1005 could affect the result. Thus all other writes in the Ticket can be ignored for this read. We refer to this process as “cropping.” The client library typically crops the Ticket before performing a read. In production, after cropping, a majority of Tickets are empty, suggesting that most reads do not correspond items the web request just wrote to.
However, given our read volume of 1016 queries per day, even the rest is a non-trivial amount.
The next step in the process verifies whether the local data store includes the relevant writes. This means assessing whether the local data store is stale relative to the read. For a simple cache like TAO, we can directly use the versions in the cropped Ticket to assess if the cache is stale. If so, TAO can proceed to the final step: fixing the result by issuing a Ticket-inclusive read upstream to other caches or the database, and caching these updated items.
These steps, however, present more challenges for global secondary indexes. Our indexing services get asynchronously updated by pipelines that reshard, transform, and filter (Figure 3). As a result, the relevance of a write to an index read query is often ambiguous. The updates do not always reach the index read servers, making it difficult to determine whether a local index replica includes a particular write. Furthermore, indexes often become too expensive to directly query from the database every time an index server is deemed stale. This means resorting to alternative strategies to handle local staleness. We address these challenges with the following combination of techniques:
Figure 3. Implementing Ticket-inclusive reads is more complicated for global secondary indexes as their update pipelines filter, transform, and reshard updates.
Client-side read repair
Some index queries return results with sufficient information such that we can patch things up on the client side. Consider the simple social graph like in Figure 2. Luna recently indicated that she enjoys Song 200 in the pop genre and Song 789 in the jazz genre. Luna’s recent write Ticket in FlightTracker would be the following:
id1: 42, // Luna’s user id
id1: 42, // Luna’s user id
version: 8 // Luna changes her mind about this song a lot 🙂
… // metadata for writes in other databases
Suppose the application uses an index query to generate the list of songs Luna enjoys in the pop genre. The initial index read result could be [Song 100, Song 123]. Song 200 is not part of the list, possibly because the update of the edge <Luna, ENJOYS, Song 200> has not replicated to the indexing server yet. In this case, the index client library first crops the Ticket down to write metadata of ENJOYS edges since all other types of writes are irrelevant. For each of the ENJOYS edges, it then reads from TAO the id2 of the edge to assess whether the genre matches “pop.” Seeing that Song 200 is indeed in the pop genre, we add it to the final list of songs [100, 123, 200]. And voila! We repaired the index read result.
Some queries cannot be easily read repaired, especially on the client side. This is because the results do not contain complete information. Consider a query which returns the count of songs enjoyed by a given user for each genre. Even if we know a recent write has occurred, we can’t accurately merge it in with an index read result. In these cases, we depend on the index itself to yield the correct response. This means we may have to wait for the updates to be fully replicated. To do so, we’ve built another service in the FlightTracker ecosystem named Arrivals. It tracks writes as they land in various indexing services.*
*Arrivals is referred to as FlightTracker-ReverseIndex in the OSDI paper.
Arrivals, essentially a reverse index, tracks a given write to its current replication status. The various stages of the index update pipelines inform Arrivals as writes are received, resharded, filtered, transformed, and finally applied (landed). On the read side, Arrivals functions as an oracle, answering whether a given write:
- Definitely does not affect a query. For example, if a user queries counts of songs they enjoy, then updates to the songs they do not enjoy cannot affect their query.
- May affect a query, but the write has not been applied to the index server yet.
- May affect a query and has been applied by the index server (has landed).
With this functionality, the social graph client library calls Arrivals to determine if an index query is stale by querying for the writes’ status in the Ticket. If an index result may be stale, the client can simply wait until the update is applied to achieve read-your-writes (RYW) consistency.
Built on similar infrastructure as FlightTracker, Arrivals is deployed as a replicated, in-memory service using ShardManager. Reusing the same building blocks allowed us to expand coverage into more domains.
So far we have been discussing RYW consistency, which is a good default but insufficient for every application. However, FlightTracker’s decomposition of the read consistency problem into identifying recent writes and serving fresh results allows us to flexibly offer varying levels of consistency. For example, to obtain a new type of consistency guarantee, we can track recent writes in different ways without changing Ticket-inclusive read implementation or any data store properties. Redefining the “session” — or what writes a “session” entails — enables us to progress beyond Read-Your-Writes and to provide a more generic session-level consistency.
Consider a pub-sub system like GraphQL Subscriptions which underlies Facebook’s notification systems. A user action triggers a published event to all subscribers. To render personalized notifications, the application may read data from TAO on the subscriber-side before delivering this notification. However, these subscribers may connect to different regions than the publisher. Thus, they could experience stale reads if event delivery outraces TAO replication. As shown in Figure 4, we solve this problem by passing the publisher’s recent writes Ticket along the event-delivery path and using Ticket-inclusive reads in the subscriber queries. This effectively provides read-the-publisher’s-writes on top of read-your-own-writes.
Figure 4. Reusing Tickets and Ticket-inclusive reads, we provide alternative write visibility guarantees such as read-the-publisher’s-writes for our pub-sub system.
As another example, we applied similar strategies to other frameworks inside of Facebook like guaranteeing async jobs see writes that occurred before they were scheduled. Importantly, the only work necessary is to transfer Tickets at the appropriate points in the communication path. Ticket-inclusive reads implementation remains the same for every supported backend.
FlightTracker has been in production for over four years. It manages RYW consistency for two database technologies, two cache types (including TAO), and three indexing systems. In aggregate, FlightTracker manages RYW for tens of millions of social graph writes per second, hundreds of millions of index reads per second, and billions of TAO queries per second. The FlightTracker service itself handles more than 100 million Ticket reads per second and 20 million Ticket writes per second.
The Social Graph serving stack sits on the critical path for serving Facebook users, so minimizing efficiency and reliability impact are crucial when rolling out FlightTracker and Tickets. As a regional, in-memory service, FlightTracker maintains more than 99.9999 percent read availability and an order of magnitude higher write availability than underlying databases. Queries and updates to both FlightTracker and Arrivals typically take less than 1 millisecond. The aggregate of Ticket-, FlightTracker-, or Arrivals-related code takes less than 1 percent CPU and less than 2.6 percent RAM.
FlightTracker freed TAO of dependence on fixed communication topology to provide RYW. This created an opportunity to improve efficiency by splitting the write path off from TAO and focusing on simplifying the read path. It also enabled TAO to have a more flexible failover pattern for greater reliability. With our read-heavy workload, 0.2 percent TAO reads have a non-empty Ticket attached. And only 3 percent of those reads pertain to updates not yet replicated via the per-shard replication stream. This keeps consistency miss rate manageable, limiting the overall cache impact of using Ticket-based reads.
Operating an end-user-centric service presented one of the most challenging aspects of initial implementation. The wide variety of platforms and APIs accessing social graph data made uniformly identifying a “logged in user” surprisingly difficult. Applications typically display good load spread among storage shards, but batch processes may operate per-user and put significant strain on single FlightTracker shards. Over time, we discovered that the most demanding use-cases need consistency guarantees the least. They typically write a large amount of data as part of a batch or migration and do not read it afterwards.
By establishing the Ticket-inclusive read contract, we uncovered consistency loopholes in underlying data stores and found low-probability bugs that cause permanent inconsistencies in TAO, graph indexes, and even database replication. These bugs range from protocol flaws, incorrect handling of error conditions, to relying on new data invariants not honored by all historical data. Previously difficult to notice, these bugs were outnumbered by transient inconsistencies. Ticket-inclusive reads should never return data older than what’s in the Ticket. So with FlightTracker, even a single occurrence of a stale result is actionable.
Although some of our motivations and implementations are Facebook-specific, our motivation to provide user-centric sessions is widely shared (such as Twitter’s Manhattan, or PathStore), as is our interest in extending consistency guarantees to global secondary indexes (such as Couchbase, Amazon DynamoDB).
The FlightTracker approach holds special appeal for read-optimized environments with caches. Even if underlying data stores provided very strong consistency, adding caches (as additional systems or on the client side) would still surface the perennial challenges of cacheability and invalidation.
Our FlightTracker approach is generalizable. Designed for heterogeneous data stores, Tickets can easily be extended to other data stores with minimal overhead. Implementing Ticket-inclusive reads and exposing versioning primitives do not require core replication protocol changes for the data store. They also have relatively small overhead. The client library, where a lot of the FlightTracker logic lives, can be implemented and rolled out gradually. Especially beneficial when trying to retrofit indexing systems, our approach allows us to separate the reverse metadata index into its own component.
FlightTracker is our approach for providing RYW consistency for Facebook’s social graph. It operates in a read-optimized ecosystem of asynchronously replicated caches, database replicas, and indexes. FlightTracker preserves the read efficiency, hot spot tolerance, and loose coupling benefits of eventual consistency. Additionally, it has enabled us to circumvent the scaling challenges encountered when using write-through caching for consistency.
For more details, please refer to our paper in OSDI’20.
We thank the following people for critical contributions to design and implementation of FlightTracker as well as the other systems in this blog post: Nathan Bronson, Kevin Doherty, Jinyu Han, Dmitri Petrov, Jim Carrig, John Hugg; Tina Park, Kevin Ventullo, Christopher Small, Andrew Bass, Tushar Pankaj, David Goode, Soham Shah, Lu Pan, Gordon (Zhuo) Huang, Brendan Forsyth, Neil Wheaton, Shilpa Lawande, and Tony Savor.