Scaling Releases: Inside Meta WWW Release Operations

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems at massive scale. Whether it’s novel design decisions, or outages that impact billions of people, providing reliable experiences for Systems at this scale present unique technical challenges.

Vladimirs Kotovs

Casey McGinty

TOPIC: Data, Systems and Networking

@SCALE SERIES: Reliability @Scale

TYPE: ARTICLE

YEAR: 2024

TAGS:

Scaling Releases: Inside Meta WWW Release Operations

Vladimirs Kotovs and Casey McGinty

As Meta continues to grow and evolve, one thing remains constant: the importance of reliable and frequent releases of company web products. We will briefly explore the WWW continuous release, its challenges, and the core components and principles that make up the sustainable release process.

WWW is the company’s oldest source-code monorepo. It contains the code that runs on our Web servers and includes a large amount of business logic and front-end code.

Over the years, the WWW release processes at Meta have grown organically to encompass many software-industry-proven techniques to enable developers to get their code out quickly and reliably in safe, small, and incremental steps without adhering to any particular technique.

The last time we publicly shared about the WWW release was in 2017. At that time, we had just introduced the continuous WWW release approach, and the WWW release has been continuous since then. It has passed the test of time. The established pragmatic approach has allowed us to release our web products successfully on rapid schedules over the last few years.

Why are we fast?

Meta has always been about shipping; reliable and frequent releases have remained a competitive advantage. We release WWW frequently for several reasons.

To allow our products to respond swiftly to market changes and user demands, maintaining a competitive edge.
Faster release cycles mean developers can see their work go live sooner, which is motivating and increases productivity.
Frequent cadence distributes changes across smaller, more reliable deployments.

Faster releases push us to move faster as an organization overall. The bar is kept high, and everyone in the company needs to take responsibility.

How fast are we?

After switching to continuous WWW releases, we went from three to ten releases a day and kept a similar cadence in 2024.

How long do our engineers wait for their code changes to get out? We track this as a Change (or Diff) Release Time metric, the elapsed time from when a change is committed to when the shipment containing the change is released to production. The average time is 4.5 hours, and its regressions slow developer productivity and create delays in mitigating other regressions in production.

WWW continuous-release cadence and change-release time have generally remained unchanged since switching to continuous releases.

Challenges faced by WWW

WWW has grown rapidly over the years. First, the growth is related to the size of the WWW codebase; the number of lines of code constantly grows due to ongoing development. That introduces multiple challenges for the release operations. For example, the increasing number of lines of code has led to a greater volume of tests that need to be executed during each release cycle. Consequently, we saw an increase in the number of test failures that must be addressed. Additionally, the growth of the codebase has negatively affected the performance of our tooling, creating further inefficiencies.

Second, the number of changes released grew as the number of developers working on the WWW codebase increased. As a result, our usual release in 2024 contains several times more changes than we saw seven years ago. But that also brought us more build breakages, land conflicts, and other obstacles scaling together with the number of engineers working in the codebase.

Also, significant growth in the number of products in the WWW and complex dependencies among them grew over the years. That put additional pressure on operational scalability and increased workload on release operations.

WWW continuous-release principles

In the past, we released the Facebook front end three times a day using a simple release branch strategy. Engineers would request cherry-picks (changes to the code that had passed a series of automated tests) to pull from the main branch into one of the daily releases from the release branch. The branch/cherry-pick model was reaching its limit, and we moved to a continuous system in 2017. The changes that get released in this delivery mode are generally small and incremental, and each release is rolled out to production in a tiered fashion over a few hours.

To understand how the WWW continuous-release platform operates, let’s first look at the main principles we use to safely deploy thousands of changes per day.

Shifting signals to early development stages

We shift health signals as early as possible to improve our trunk health. Newly drafted commits trigger multiple layers of health checks to help with validation. A single change may pass thousands of targeted tests before acceptance, and developers can perform additional testing with manual inspection and by running short-term canaries. Once committed, there’s an additional integration-testing phase.

Curated reliable metrics

Release-time validation is highly curated to focus on detecting critical failures. This ensures test execution is efficient. The goal is not to be perfect in every release, but to make incremental improvements that are good enough for a few hours.

Efficient gradual exposure

We rely on multiple deploy phases to limit the overall impact to users. In early phases, we can recover faster and prevent or vastly limit external exposure. The number of phases and time spent in each is minimized to increase overall speed. Developers can also independently roll targeted features post-deployment with an internal control platform called Gatekeeper.

High bar for reverting

Late errors in the release process are evaluated for risk and severity. Product teams understand this and protect changes with feature flags to limit the effect. Still, when mitigations are unsuccessful, getting a fix into the next release should take only a few hours. If we stopped for every regression, it would block multiple high-priority fixes and delay other changes waiting to reach users.

WWW continuous-release overview

Figure 1. Phases of WWW continuous-release process

WWW deploys continuously all the time, every day, 24/7, including on weekends and holidays. We perform trunk-based deployments; there are no special branches or changes that get excluded from the next release.

There are several main phases:

Build/Test
Three deployment phases, c1/c2/c3, which align to alpha, beta, and production quality.

The phases operate independently with limited coordination and are updated opportunistically, using the latest validated release from the previous phase.

The build/test phase always starts from the latest commit in trunk. Multiple build jobs run in parallel, allowing compilation issues to be detected and fixed quickly.

Once a new build/test cycle is complete, it will deploy to the first staging tier, c1.

From c1, employees have access to their changes, and we can spot any major issues. Within a few minutes, the c2 phase will start updating using the last successful c1 release. The c2 region has a small percentage of production traffic. This means changes can go live for some users in under two hours.

After the c2 update is complete and c3 is still running the previous validated release, the deployment pauses to collect signals.

Any large deviations between c2 and c3 traffic will pause the deployment and notify metric owners to help triage.
When there are no detected regressions, the release is promoted to c3.

In total, the full cycle from c1 to c3 can be completed within three hours.

The simplicity is a feature! Developers can estimate when their changes will go live, and there are a relatively small number of changes in the release, which helps us quickly identify breakages.

And remember, except for c2 and c3, each phase runs independently

So newer builds are progressing when c1 is updating, and…
c1 can update multiple times while c2 and c3 are being deployed.

Sustainable WWW release process

To meet this changing scale and business needs, we adjusted some of our technical and operational approaches over the years while keeping our core principles and velocity intact.

Without these adjustments, we would have faced delays in releases, more frequent and prolonged production issues affecting reliability, and an unsustainable workload for the team responsible for managing the release process.

It would not be possible without focusing on the following three key areas:

Figure 2. Key areas of release sustainability

Company-wide ownership culture

Fostering a company-wide ownership culture is essential for maintaining release speed and reliability.

We have established a global volunteer on-call team, internally called TreeHuggers, who manage all major release operations around the clock with three primary rotations across the globe. These engineers are deeply committed to understanding the WWW codebase and release process, working to reduce release exceptions through a combination of automation and hands-on efforts.

Company-wide collaboration and shared on-call support created a virtuous cycle, enabling the core release team to focus on improving infrastructure and processes, such as release speed, automation, tooling, and safety mechanisms. This approach helps reduce the need for manual intervention in case of release issues.

Over the years, we have shifted our focus more on engineers not specialized in release engineering to sustain our continuous WWW release speed and reliability. Following this principle, our release tools are designed with usability and safety in mind for non-specialist users. In one example, our central release-management tool streamlines WWW release management by abstracting complex deployment pipelines. It provides a clear overview of the repository state, ongoing deployments, and recent build outcomes. Key actions like stopping a release, blocking specific revisions, or starting a new build are easily accessible as well.

Shared company-wide release responsibility helps foster best practices and a strong release culture, ensuring that everyone is aligned in the mission to ship code safely and reliably at scale.

Product developers support their changes and own their product quality, and over the last few years, we have ensured that they have more control. Developers have alerts in place to detect issues during the release for quality assurance, in addition to the checks performed during continuous integration.

The ownership culture is wider than just the alert-infrastructure level. At Meta, employees are always testing new changes in c1, where we can get early signals and prevent bad changes from being released.

Release tooling and automation

The next main area has been a consistent focus on building better tools that increase our overall automation. Since 2017, we had the primary release path automated, but for everything else, the release operators were essential for finding and clearing errors to keep the release moving.

So, seven years later, we have multiple times the number of changes and significantly more test runs during release. Despite this, the number of automated releases increased to over 95 percent. This was only possible with automation to solve many of the common failures.

For example, automation corrects syntax errors introduced by “land” races. Next, test failures are auto-categorized to determine if they were due to a recent commit or an external issue. And when we confirm an error, automation will revert the bad commit and any additional changes related to the same feature, which we call “stacks.” This is generally safer and reduces the chance of merge conflicts.

One crucial aspect of our tooling and automation improvements involves explaining how we decide where to invest our time and effort. There’s never enough time to tackle every problem. Therefore, it’s essential to identify which issues have the most significant impact and by what margin.

To address this, we’ve developed metrics that reflect the release team’s priorities. These metrics help us pinpoint regressions and opportunities for enhancement. Over time, as we improved our processes, the metrics we focused on evolved.

Initially, we tracked the release rate, which later shifted to categorizing the number of fully automated releases to help enhance automation. When we identified a small number of “heavy” releases, we introduced an Operational Load metric to assess overall efficiency. These changes required accumulating more data, which is why the evolution took time. For improvements to release issues and resolution speed, we track the duration and severity of unplanned release stoppages with Recovery Time and overall Change (Diff) Release Time metrics.

Having and refining this data is crucial because it helps us frame new features in the context of these metrics, keeping projects focused and alerting us to any unexpected negative effects.

Independent monolith deployments

From Facebook’s very beginning, the WWW release was run as a single pipeline. But now, WWW is more than just facebook.com, and we’re rethinking how to give more control to product teams that want to be directly responsible for their deployments. Internally we call this “multitenancy,” but it essentially provides a distinct release pipeline for select products.

A tenant pipeline will normally mirror the main release but with extra features such as

Custom product-specific signals
Manual-release controls for skipping specific releases
Gain in the ability to rollback their pipeline
Release before the core WWW pipeline

Therefore, product owners can have more direction over when they release. In return, these teams perform some of the daily operations, like a specialized TreeHugger. And they also invest to improve the quality of their health signals.

Figure 3. Product-release transition with WWW multi-tenancy

Takeaways

Releasing from monorepo at the Meta scale is a challenging problem because of the codebase size, number of changes, complexity, and number of developers working on the same code simultaneously.

However, monorepos remain our best option for large codebases like WWW, with many benefits including consistent versioning and development experience, a well-integrated code ecosystem, and centralized build and test systems.

Core principles are essential, and we have overcome growth, kept release velocity, and achieved some great wins for the company by following the WWW continuous release approach and our principles of:

Shifting signals to early stages in the development process
Releasing quickly and iteratively, aiming for good rather than perfect results
Maintaining efficient gradual exposure
Keeping a high bar for reverting releases in production

Sticking to established core principles, even when faced with challenges, provides a framework for making decisions, helps to stay focused on long-term goals, and maintains consistency. Despite the challenges, frequent releases are a competitive advantage, increasing developer productivity and motivation, and distributing changes across smaller, more reliable deployments.

Over the years, we’ve adjusted our technical and operational strategies to accommodate our growing scale. A strong focus on fostering a company-wide ownership culture, leveraging automation, and implementing proactive mitigation without human intervention has been key to navigating our release processes. Company-wide collaboration is essential for maintaining monolithic releases reliably.

Finally, evolution is on the horizon, with federated control of monolith deployments. Multitenancy enabled us to evolve from a single monolithic deployment to a more flexible delivery system to address operational scalability challenges.

What truly made our journey possible is a strong company-wide involvement of many infrastructure and product teams. Our release experience might differ from yours, but we hope the main takeaways will be applicable and helpful.

Scaling Releases: Inside Meta WWW Release Operations

Vladimirs Kotovs

Casey McGinty

TOPIC: Data, Systems and Networking

@SCALE SERIES: Reliability @Scale

TYPE: ARTICLE

YEAR: 2024

TAGS:

Vladimirs Kotovs and Casey McGinty

Why are we fast?

How fast are we?

Challenges faced by WWW

WWW continuous-release principles

Shifting signals to early development stages

Curated reliable metrics

Efficient gradual exposure

High bar for reverting

WWW continuous-release overview

Sustainable WWW release process

Company-wide ownership culture

Release tooling and automation

Independent monolith deployments

Takeaways

SUBSCRIBE TO @SCALE

RECENT POSTS

RELATED POSTS