Vacuum Testing for Resiliency: Verifying Disaster Recovery in Complex | Jie Huang & Christopher Bunn
Engineers at Meta run thousands of services across millions of machines, and those services all have similar needs that can’t be managed by hand: configuration, deployment, monitoring, routing, orchestration, security. To solve the ever-growing complexities of running in production, we build a lot of infrastructure. But infrastructure is just software, and it has those same needs, too. So what powers those systems? We like to imagine a neat, tidy stack of yet more, precisely layered infrastructure, where everything is organized into a faultless, acyclic graph of dependencies. Turtles all the way down. But in reality, the lower in the stack you go, the more you find infrastructure is – needs to be – deeply intertwined. After all, infra systems need to run on millions of servers, just like the rest of production. This can present a scary prospect: if something breaks, how do we know we’ll even be able to turn things back on again? In this talk, we’ll present BellJar, a new technique we’ve developed to exercise critical infrastructure in environments where nearly all of Meta’s supporting services don’t work. By providing precisely tailored broken environments – each of which is specially vacuum sealed away from the infra we take for granted – we can learn exactly what it takes to bring each system back online during widespread outage. And by incorporating this tooling into our delivery pipeline, we can incrementally bake resiliency back into our ever-growing infrastructure layer cake, even as it continues to evolve.