The Facebook cloud supports a variety of workloads including those which are CPU intensive, memory bound, I/O bound, latency sensitive, or a combination of these, on hardware that ranges from smaller single socket servers to load balancers and switches to multi socket hosts with accelerators.
This talk is about how we control for resource usage and contention of infrastructure software that runs across this cloud environment, including OS services and proprietary services required for successful operation of primary applications. Such software has significant resource constraints to support our wide variety of collocated workloads, and we frequently encounter trade-offs on the axes of new functionality, efficiency, and reliability. Predictable resource usage for host infrastructure software at scale is one of the core abstractions in the Facebook cloud infrastructure which gives our services the ability to run at high utilization.
We will go into details of our resource accounting and measurement infrastructure, processes established to keep levels of resource consumption consistent and predictable, and the technologies that allow core Facebook services to utilize the remaining capacity with minimal contention and high reliability. We will also highlight how our learnings and best practices could be used in similar private cloud environments.