Why is a Cultural Anthropologist on a Reliability Engineering Team?

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems at massive scale. Whether it’s novel design decisions, or outages that impact billions of people, providing reliable experiences for Systems at this scale present unique technical challenges.

Casey Bouskill

TOPIC: Data, Systems and Networking

@SCALE SERIES: Reliability @Scale

TYPE: ARTICLE

YEAR: 2023

TAGS:

For many of its formative years, Facebook subscribed to the motto “Move fast and break things.” I did not work for Facebook at that time, but I reaped the benefits of the many products and features borne out of that era of speed and innovation. In fact, I was on The Facebook as a college freshman studying anthropology in 2004. My friends and I loved the novelty of it as we dotted our “walls” with inside jokes or tried to find out who in Calc III knew how to prove the Lagrange theorem.

Throughout my studies, Facebook became more and more sophisticated, and my network of friends became much larger and increasingly international. I became more intentional about what I shared and more reliant on Facebook and WhatsApp to communicate as I traveled around the world doing research. Facebook certainly knew that it had been transformed from a novelty into a powerful set of products that the world was increasingly adopting as its platform for self-expression, connection, and everyday communication.

Around 2014, as I was wrapping up my Ph.D. in anthropology, I thought about how the digital world enacted through platforms like Facebook was fundamentally changing cultures all around the world. Facebook knew that, too, and as a rapidly expanding company, “Move fast and break things” of the early days gave way to “Move fast with stable infrastructure”—definitely less catchy, but necessary given its scale and the foothold it had across the world.

In 2019, and in the wake of a few public and consequential outages, Facebook knew it needed to put more muscle behind its new motto. Leadership formed the Reliability Engineering team to help shape a strategy of building and maintaining that stable, reliable infrastructure. These unsung heroes created a solid foundation through direct outreach to organizations to stress the importance of reliability and to be the leaders who either prevented outages from occurring or swiftly fixed them when they did. But with respect to changing our culture to fully espouse reliability and make it a core facet of our work, Reliability Engineering leaders suspected that there was room to grow.

Around the same time, I was employed as an anthropologist in the policy world, going between fieldwork sites in rural Uganda and the streets of Beirut and a dozen places in between, all in service of understanding how culture impacts policies and practices. My job was (and is) to understand culture change, and to relay local cultural perspectives, needs, values, and challenges back to decision-makers so that policies and practices make sense to the people who have to live with them. It is a privilege to have heard people’s stories around the world and to do my best to amplify their voices so that they, too, have a seat at the table.

The Reliability Engineering team recognized that they needed to tap into Meta’s internal culture to fully understand and appreciate how we can shift our culture to make reliability engineering a fundamental aspect of our work—not an afterthought, but something deeply tied to how we build our systems. Knowing that culture change is at the root of what I do as an anthropologist within industry, Reliability Engineering enlisted me to join this growing movement within Meta to build a strong culture of reliability.

So, these days I am mostly working from home, but not at the expense of an interesting fieldwork site. In fact, I am arguably in one of my most fascinating sites yet—Meta’s Infrastructure. Meta employs a handful of anthropologists, and while my counterparts perform work around the world to understand, for example, the key barriers people with disabilities in the UK face in using virtual reality and how women in India leverage Facebook Communities to bond with one another, I have a unique opportunity to exclusively study Meta’s internal culture and how it intersects with new policies and practices introduced as we continue to scale and mature. Studying the culture where I am employed means I occupy a peculiar but fun role where I am fully both an insider and outsider, meaning I am totally a part of our internal culture, but it’s also my job to step back and study that culture as objectively as I can.

While my engineering colleagues proactively and reactively manage incidents 24 hours a day, I play a supporting role as their resident researcher striving to make reliability a first-class value. But you don’t have to be an anthropologist to contribute to culture change at your organization, so I want to share how we have undertaken this goal and provide insights into how you can adapt this research approach to your context.

Given that anthropology is the study of humans, I prioritize taking a human-centered approach to understanding problems and their solutions. Anthropology offers a unique way of getting at a human-centered approach in that we use what is referred to as the etic and the emic perspective (readers interested in linguistics may recognize the words phonetic and phonemic—the same principles apply). The etic perspective is the outsider’s view on a phenomenon and the frameworks outsiders lay onto phenomena of interest, while the emic represents the insider cultural knowledge, values, and ways of interpreting and experiencing a phenomenon. If the two perspectives are not in dialogue with one another, it can at best introduce undue strain on people and at worst cause even the most well-meaning initiatives to fall short.

When it comes to reliability engineering, the etic perspective might assume certain frameworks, structures, and standard operating procedures that dictate when something is reliable, how the work to ensure reliability should be done, and how the work should be balanced against competing priorities. The emic perspective breaks any assumptions about reliability by focusing on how on-the-ground engineers describe what it means to them to engage in reliability work, their perceived tradeoffs in performing that work, and the things that help block or facilitate them along the way. The emic tells you from the local perspective why things are or aren’t working on the ground; if you are lucky, the etic frameworks will reflect whether or not things are working. Ensuring that the two perspectives are in sync with one another helps to align expectations and interpretations, and ultimately to arrive at the desired state.

What do these different perspectives mean in the context of reliability engineering? One would be hard pressed to find any aspect of engineering that isn’t complex, but reliability is in a class unto itself, for two main reasons. First, because of the room for interpretation (for example, “How reliable does something need to be?” “Where does reliability engineering stop and performance engineering begin?” “Who is responsible for reliability in a hyperscale, deeply interdependent system?”). Second, because it is difficult to prove the counterfactual ( “Was the absence of an outage the result of proactive reliability efforts or just dumb luck?” “How do we know when the work we’re putting into reliability is sufficient?”).

When we set out in Spring 2022 to research how we can build a stronger culture of reliability, our general inclination was that reliability work was just not as valued, exciting, or rewarding as shipping new features. Efforts such as complexity reduction or performing better SEV reviews can’t compete with, say, launching Threads. There will always be competing demands. But culture can and does change to accept new values. Cigarette smoking is an example: Cultural marketing made it uncool, health messaging revealed its harms, strong policies limited where people could smoke, and high taxes often made it prohibitively expensive. The trick is to understand where a new value can tap into the existing culture and stick. In our case, we didn’t need to make reliability “cool,” but we did want to make it an intrinsic part of our development process, as opposed to an afterthought or, even worse, something we did only when it was absolutely critical.

I started this research project by defining the etic perspective. Meta directors and senior reliability experts generously spent hours teaching me the gold standards of reliability work. I poured through published literature to read about reliability engineering, both in practice and theory, and watched tutorials from the titans of tech companies. Taken together, this work helped me identify industry standards as well as Meta’s expectations of what reliability should look like in practice from the reliability experts’ view.

Then I turned my focus to the emic—the local perspective—which meant finding people who are not directly involved in reliability work. Identifying a sample of people who are not skewed towards one particular perspective requires careful consideration. I knew my immediate Reliability Engineering team had software engineers, production engineers, technical program managers, program managers, data scientists, and others whom they wanted to see included the sample, but even if that group of people offered a range of perspectives on reliability, their viewpoints would be inherently biased by their connections with Meta’s reliability experts. Perhaps they had mitigated a big outage or some other close call, or perhaps they were just known for being unsung heroes of reliability wins, but in any event, I needed to broaden my sample to reach people beyond those put forth by the immediate team.

Fortunately, at Meta it’s normal to change teams, and a year and a half of tenure had nearly made me a veteran in the tech world—so I had met cross-functional partners across our Infrastructure whom I could contact as a starting point. For example, I asked the Technical Program Manager in Privacy if she knew anyone in WhatsApp, and the software engineering manager in Central Integrity if he knew anyone in Reality Labs, and so on. I checked my list to ensure I had representation from around the world, varying levels of tenure, managers versus Individual Contributor status, and domain across the company. When I asked people if they would be willing to participate in an interview with me on reliability at Meta, their reply of “Wait, me? Are you sure you have the right person?” meant I was on the right track.

A common question I hear with qualitative-research approaches is “how many interviews is enough?” Fortunately, that is something my discipline has empirically studied. First of all, enough to do what? If you are testing a hypothesis, you need to be statistically powered through an adequate sample size to detect an effect. In exploratory research, you are concerned not with testing a hypothesis, but rather with building hypotheses, some of which are contingent on particular circumstances. Given that, performing qualitative research is less about a definitive number of interviews to be “enough” and instead about ensuring that you have the right representation of respondents to weigh in on your questions and that you start to hear repetition of themes (i.e., people start to say similar things). In the published literature, it takes about nine to 12 interviews in a relatively homogenous sample to get thematic repetition, and roughly another 10 before researchers are able to fully grasp how those themes fit together. I hit 25 and was learning so much that I kept going—beyond 40 interviews-—just to be sure I had covered all of my bases and could confidently connect the dots on how people discussed what it would take to build a stronger culture of reliability.

Another warranted question I hear is, “How do you make sense of the thousands of pages of interview transcripts from your data collection?” What I implicitly hear is, “How do we know the research team is not cherry-picking quotes to support their underlying beliefs?” This is a fair concern, but one easily refuted by taking a rigorous, empirical approach to qualitative data analysis. We did this by following standard procedures for thematic analysis, which involved co-creating a codebook that captured the types of questions that were asked in the interviews—how someone defines reliability, support needed to perform reliability work, team-based reliability efforts, how someone describes the meaning of the work they do—and then importing all of the codes and de-identified transcripts into a qualitative data analysis program. From there, we collaborated on what chunks of text from an interview should be included under a code. I routinely lead qualitative data analyses and have documented this process in several published articles, but in short, it’s about ensuring that all of the conclusions drawn came from transparent, reproducible analyses (i.e., anyone can trace the steps from the conclusions back to the actual interview data).

The process of interviewing 40-plus people followed by an in-depth thematic analysis is labor-intensive. The advantage of this approach is that it yields rich, contextual data and helps you understand why certain phenomena are occurring and why people feel that way about them. Interviews help point you to what is most salient about a particular issue. The interviewer poses a few questions, but it’s ultimately their job to let the interviewee steer the conversation and reveal what really matters. The disadvantages are that this approach cannot be easily replicated, and that contextual depth comes at the expense of breadth.

That’s where surveys are useful, provided that they incorporate the salient constructs that emerged from the interviews—the ways to describe a theoretical concept or idea—the abstract stuff like “sentiment” or “risk” or even “reliability.” We used the major themes from the interviews to zero in on a few select questions for a reliability sentiment survey, which in broad strokes tracks the perceived value of reliability and the barriers and facilitators to doing this work. We have collected three waves of survey data collection that are helping us track changes over time and identify any correlations between the strategies employed by Reliability Engineering and survey outcomes.

One key finding described in interviews and corroborated in surveys was that while nearly everyone agreed that reliability work is important and thought we as a company should do more of it, many people called out a few key barriers. One of the most important barriers mentioned was the perception that reliability would not be as recognized or rewarded in our performance as other types of work. They mentioned that reactive reliability work would need to reach a tipping point before managers would give the green light to devote a half-year to reliability, and that proactive work would likely be seen only as an add-on, over and above one’s duties. This concern did not seem to fully align with the broad messaging from leadership on the importance of reliability, but somehow, that etic message was not necessarily landing with emic concerns from engineers on the ground.

We decided to look into this perception against the data, and what we found was that reliability recognition in our performance reviews has been increasing over time and is correlated with higher ratings. Hence, there was a slight mismatch between the etic and the emic. We reasoned that more systemic changes would help drive that change in perspective, and our Reliability Engineering leaders used the qualitative and quantitative data we collected to influence internal policy on expectations of developers across levels to include reliability. We are still waiting to evaluate the full impact of that change, but early signals from our last survey show that developers are potentially still cautiously optimistic about the systemic shift, which means we will stay committed to a focus on reliability as a core component of our work. And we will continue to work to ensure that the etic and emic are in sync.

In interviews, engineers elaborated on themes such as dealing with on-call toil, burdensome incident report writing, dependencies, lack of widespread reliability metrics (e.g., SLIs/SLOs for a service) and difficulty linking metrics to impact, and needing more education and training on how to operationalize reliability work. We turned these barriers into survey items so that we can continue to track them over time and draw on them to prioritize our own efforts. For example, we have reduced the burden of incident reports, provided dashboards to help teams track their on-call health and manage their dependencies on other services, and kickstarted an initiative not only to set more robust reliability metrics, but to use them to drive organizational- and team-level improvements in reliability. Like reliability, the work is never complete, but at the very least, systematic tracking of reliability sentiment helps our company center in on the most impactful areas at particular times and understand where our efforts are driving down barriers and where we need to continue our efforts.

At its core, my role is to listen to people and to help connect the dots, and as much as I would be delighted to see more anthropologists employed across the industry, the title is insignificant vis-à-vis that core principle of just “listening to people.” In other words, there are likely already people on your team who can employ this approach—but a word of caution: If there is any sort of explicit or implicit hierarchy between the person asking the questions and the respondent, don’t expect to get the full picture; people tend to be less forthcoming with higher-ups than, say, with a researcher who has no bearing on anyone’s career trajectory and is bounded by confidentiality standards. And while certainly not a requirement, it may help if the person doing the research skews a bit more toward being an extrovert, which isn’t perhaps the first thing we think of in the tech industry; by my latest count, the Introverts @Meta interest group had 4,308 members, Extroverts @Meta: 115.

Personality traits aside, the principles are fairly straightforward: Start with in-depth discussions with a range of respondents who represent the phenomena of interest you want to understand. Ensure that your sample does not solely consist of people who will give you the answers you may want to hear. Keep performing interviews until you are confident you are not learning anything substantially new (for example, until themes repeat across interviews). Then decide on the metadata categories that your team wants to apply to the qualitative data and use them to organize your findings. Explore the range of responses under each metadata category and try to understand how those different themes intersect with one another to inform your conclusions.

Where possible, use the findings to draft straightforward survey questions with an exhaustive list of response options. Test the survey by asking a representative sample of potential respondents to take it—time them to check that the survey can be completed in a reasonably short amount of time, and ask them for their interpretation of each question and response option to ensure that your team’s interpretation of the questions match those of the respondents. Field the survey at regular interviews, and track changes in the quantitative results over time.

Use the qualitative and quantitative findings to identify opportunities for improvement, and prioritize your roadmap. Remind decision-makers of the findings regularly, and pair succinct, exemplary quotes from interviews with statistics, as that can help drive concrete actions. Keep an open and curious ear to the ground for any new information and incorporate it into ongoing research efforts. Ultimately, remember that culture is our most versatile tool in the human toolkit. Culture is what makes us human, and it might just be the catalyst to help all of us across the tech industry adapt to an ever-evolving, complex world.