Within Meta’s private cloud, efficient distribution of large, hot content to hosts is an increasingly important requirement. Executables, code artifacts, AI models, and search indexes represent commonly distributed content types that help enable our software systems.
Three dimensions capture the scope of the task:
- Fanout: the same content may be read by anywhere from a handful of clients to millions of processes running in data centers around the globe.
- Size: objects to be distributed range from around 1 MB to a few TBs.
- Hotness: clients may read an object within a few seconds of each other, or their reads may be spread over hours.
Distribution requirements are exacting. First, content distribution must be fast: the predictive value of AI models decreases over time. Slow executable delivery increases downtime and delays deploying fixes. We strive to provide data at a rate bounded by either available network bandwidth of the reading host, or by the available write bandwidth of its storage media.
Second, content distribution must be efficient. One dimension of efficiency is scalability, i.e., the number of clients that can have their distribution needs met by a given number of servers. Another dimension involves network usage. We measure this both in terms of bytes transmitted and communication locality (e.g., an in-rack data transfer costs less than a cross-region transfer). A final dimension of efficiency includes resource usage on client machines; e.g., CPU cycles, memory, and disk I/O. Not only should we use as few resources as possible, we should also be able to adjust for their relative importance on different clients. Some services are memory-constrained, while others cannot afford to write to disk.
Finally, content distribution must be reliable. We measure reliability as the percentage of download requests that the distribution system satisfies within a latency SLA. Operational ease-of-management is another facet that often gets overlooked as a prerequisite for high reliability. In a production environment, workloads may change. Additionally, dependency SLA misses, including partial outages or performance faults, occur with some frequency. In order to maintain a high SLA for distribution during such events, engineers must be alerted quickly and have a clear picture of operational health for each client. Finally, when reliability, speed or efficiency start to degrade, they need simple knobs that adjust behavior, to quickly restore operational health.
Prior to our work, Meta had at least three different systems for large content distribution that had grown organically over time to cater to this problem in separate domains – executable distribution, model distribution and index distribution. However, no prior solution met all of the above requirements. We identified two root causes:
- no prior system struck the correct balance between decentralization and centralization.
- no prior system had sufficient flexibility to meet the requirements of the varied service types at Meta that require content distribution.
The first implementation provided highly-centralized distribution via hierarchical caching. Clients downloaded content from first-level caches on remote hosts. These caches, in turn, handled cache misses by reading from other caches, with the final layer of the hierarchy being a distributed storage system. Hierarchical caching was inefficient for hot content distribution, and had difficulty scaling. The cache hierarchy needed a large number of dedicated hosts, with the number of hosts increasing to keep pace with growth in workloads from services consuming data and the number of reading clients. Load spikes caused by hot content presented a continual problem: strict quotas were necessary to protect the centralized caches. However, readers of hot content frequently got throttled due to the bursty request pattern. In general, provisioning for transient spikes caused by hot content and setting quotas appropriately was quite challenging.
Meta also used two highly-decentralized systems: a location-aware BitTorrent implementation, and a static peer-to-peer distribution tree based on consistent hashing. In both cases, a peer is any process that wishes to download data and millions of such processes exist at Meta. The decentralized systems scaled much better than hierarchical caching. But they brought their own problems. First, because each peer made distribution decisions based on local information, resource efficiency and tail latency could be poor. With each peer making independent caching decisions, the collection of peers could retain either more or less copies of a data object than necessary. More importantly, operating these decentralized solutions was difficult. Engineers struggled to get a clear picture of health and status without aggregating data from large numbers of peers, given that each peer had a different and limited view of the state of distribution. In general, reasoning about system-wide correctness or efficiency proved very hard to do.
In summary, highly-decentralized systems turned out to be inefficient and difficult to operate, while highly-centralized systems scaled poorly. As a result, we chose to create a new, split design with a decentralized data plane and a centralized control plane. The decentralized data plane streams data from sources to clients via a distribution tree built by the centralized control plane. However, these trees are ephemeral and per-data-chunk. Each edge in a tree persists only while a chunk is being transferred from a source to a peer.
The design realizes a mechanism-policy split. Peers are simple and provide the mechanism for caching and transferring data chunks. The centralized control plane makes fine grained policy decisions about distribution. This includes identifying sources from where peers should get each chunk of content, when and how they should cache fetched content, and how they should retry failed downloads. The control plane consists of a handful of trackers (borrowing terminology from BitTorrent). They have a complete picture of the distribution state, what data each peer is downloading, and a list of chunks in each peer’s cache. This fine grained state enables trackers to make optimal decisions about data placement and distribution that minimize network hops and maximize cache hit rate. Centralizing the control plane has also made distribution easy to operate and debug: engineers can understand what decisions led to low availability, high latency, or poor cache hit rate because a tracker with a consistent view of the distribution state makes these decisions.
The second major problem faced by prior systems was a lack of flexibility. At Meta, clients have vastly different resources to spare for distribution. Some clients can dedicate gigabytes of memory or disk for peer-to-peer caching. Others have no resources to spare. Clients have very different access patterns and scale. Finally, the objectives for distribution can differ. Some clients need low latency, while others wish to reduce the load on external storage to avoid throttling. Each of the previous solutions was customized for a subset of the use cases optimizing for its client’s needs. To unify the disparate distribution solutions, we could not simply provide a one-size fits all solution. It was critical that we did not regress any client’s key metrics.
We therefore chose to make customization a first-class design priority. Trackers implement modular interfaces for specifying different policies for caching and fetching data. Further, each policy is itself configurable to allow for different tradeoffs across client types and responses to changing workloads. We use trace-driven emulation to search through the space of possible customizations. Doing so lets us find the best policies and configurations for each observed workload.
Owl: An Overview
Our solution, Owl, is a highly-customizable data distribution system with a centralized control plane and a decentralized data plane. Owl has been in production at Meta for 2 years and has scaled out rapidly (production traffic increased by almost 200x in 2021). Currently, Owl has over 10 million unique clients (binaries concurrently using the library), and it downloads up to 800 petabytes of data per day. Owl supports 106 unique types of clients and has customized policies for 55 of these.
Owl includes two basic components:
- Peer libraries: libraries linked into every binary that uses Owl to download data. If a binary is using Owl, we refer to it as a peer.
- Trackers: dedicated Owl services that manage the control plane for a group of peers.
Owl peer library provides a simple API for downloading data. Client processes fetch content from a source object by specifying a range of data to read and a unique identifier for the object being read. The caller can optionally specify a deadline and classes that check data integrity or decrypt provided data.
These peers can cache data in memory and/or on disk. The caches may be shared with the client binary if the client does not modify downloaded data. Owl uses the caches to serve content requests from other peers. Owl policies usually prefer to fetch data from a peer rather than an external data source. As a result, peer-to-peer distribution satisfies most requests.
A tracker manages download state for a set of peers. Typically, peers and trackers are grouped by region (a region is several co-located data centers), with 3-4 trackers per region providing scale and redundancy. Trackers are multi-tenant. In general, each tracker supports all Owl buckets. However, we use a separate set of trackers for binary distribution to provide strict performance isolation for this sensitive workload.
Trackers associate data and peers. For each chunk, tracker metadata specifies which peers are caching the chunk and which are downloading it. Tracker metadata also specifies the source of each peer’s download e.g., an external source or another peer. For each peer, the tracker metadata specifies the peer’s location (host, rack, region, etc.) and its cache state (the chunks in the cache, last access time, and so on). In contrast to highly-decentralized systems like BitTorrent, Owl trackers can maintain such detailed up-to-date state because trackers make all major decisions about caching and downloading chunks on behalf of peers.
Peers associate with one tracker. Each peer picks a random instance from the set of available trackers and registers by sending an RPC. Peers register with a new random tracker if their association with the current tracker fails.
A primary design principle in Owl entails keeping peers as simple as possible.This is achieved via a mechanism-policy split. The peers provide the mechanism to perform simple actions. These include downloading a chunk from a given source, caching or evicting a chunk from cache, or providing cached data in response to a request from another peer. To download content, peers ask trackers to decide from where they should fetch content, how they should retry failed downloads, and even what chunks they should cache locally.
This design principle has proven invaluable for operational simplicity. The team can change configuration values on trackers within seconds if necessary, and roll out new policies regularly without updating any of the peers. We associate each peer with a bucket that uniquely identifies the type of the client binary with which the library is linked. The bucket provides a way to customize Owl behavior for each type of client. It also lets us monitor usage, performance and reliability for each Owl customer individually.
Our paper that recently appeared at OSDI ‘22 describes the design and implementation of Owl in more detail. It discusses the challenges we faced in scaling Owl to handle workloads at Meta, and how we use Owl’s flexibility to customize its behavior for each type of client. Finally, it relates some additional challenges we faced during deployment and the insights that allowed us to overcome those challenges.