At Facebook, virtually all our infrastructure is powered in some fashion by Apache Zookeeper. This includes service discovery, configuration management, package deployment, cluster management — every piece of our infrastructure must maintain commitments of consistency and durability in the face of machine failures, network partitions, and human error. More often than not, Zookeeper is the low-dependency metadata storage service of choice.
The infrastructure that Zookeeper powers comprises a ubiquitous cluster management platform, atop which the rest of Facebook’s software runs. This platform autonomously manages thousands of services across millions of machines, providing a huge degree of flexibility and leverage for engineers.
So when Facebook’s Zookeeper team decided that they, too, wanted this flexibility and leverage, it meant turning our dependency graph on its head. In this talk, we will present the 18-month journey that brought hundreds of Zookeeper ensembles in from the cold bare metal so they could safely run atop the cluster management platform that they make possible.
Speaker
Christopher Bunn,Meta