Live Traffic Load-Testing-Measuring and Validating Capacity at Facebook
Facebook is made up of hundreds of heterogeneous services in geographically distributed data center regions. To reliably run, providing a sufficient amount of capacity for all sub-systems and services is crucial. However, understanding and measuring service’s maximum servable load presents challenges. We further need to validate the entire ecosystem across multiple services to ensure sufficient capacity for Facebook product. However, measuring max load and verifying capacity at Facebook-scale involves a few challenges: 1) The workloads are constantly evolving and changing as the user base grows and new products are launched. 2) Software constantly changes as each service deploys new versions. 3) The interdependencies across services contain inherent complexity.
To address these challenges, we developed a load testing framework leveraging live traffic in production systems: 1) service-level load testing to measure the maximum servable load of individual service; 2) region-level load testing to verify the capacity of a product. We’ll share how we scaled service-level load testing to a large number of services, overcoming service diversity and how we improved data quality despite noisiness of a live traffic. For regional-level load testing, we will focus on how we safely conduct such large-scale load testing. We will also share findings and learnings from the regional-load testing such as load balancing issues and the scalability of individual services.
We have been leveraging both types of load testing in Facebook for over five years. For service-level load testing, 80,000 tests are running across hundreds of services every day. The number of machines allocated for those services is more than one-third of the entire capacity. To maximize the value of load testing, we are actively working on increasing load testing coverage. For service-level load testing, we work with stateful and storage services to explore the best options for applying & leveraging load testing. For region-level load testing, we work with a broader set of products to generalize the idea of region-level load testing across all products. With the increased load testing coverage, we expect to serve our Facebook users with even greater reliability.