Root Cause Analysis (RCA), a common practice throughout the software industry, does not provide any value in preventing future incidents in complex software systems. Instead, it reinforces hierarchical structures that confer blame, and this inhibits learning, creativity, and psychological safety. In short, RCA is an inhumane practice.
Fortunately, there are healthy alternatives to RCA. I’ll touch on that in the conclusion, but first let’s disambiguate RCA, what it is trying to achieve, and why it never works.
A common misconception that encourages the embrace of RCA is that without understanding the ‘root cause,’ you can’t fix what is wrong. Let’s consider the typical RCA process. It is triggered by an incident. As the incident is being resolved, data is collected about how the system is behaving and how people are responding to try to fix it. The system is eventually ‘fixed’ and the incident is over. Usually about a week later a facilitator interprets the data gathered, pulls stakeholders and responders into a meeting, and shares the narrative of the root cause, how it was fixed, and what measures should be taken to defend from that cause again in the future.
The facilitated interpretation is the RCA. Notice that the incident was ‘fixed’ prior to the start of the RCA. To hone in on some clarifying language, let’s differentiate between the ‘root cause’ and Least Effort to Remediate (LER). As long as an incident is ongoing, LER is absolutely the right thing to pursue. When the building is on fire, put the fire out as quickly as possible. Unfortunately it is all-too-common to assume that whatever undoes the damage must also point to what caused it.
Consider an example taken from a large scale outage that was not publicized but resembles many recent high-profile outages. A configuration change was deployed that brought down a service. The outage was big enough that time-to-detect was almost immediate. The time-to-remediate took a bit longer though, on the order of about 40 minutes. Service was restored by reverting one single line of the configuration file.
The start of the incident corresponds to the publication of the one line of configuration. Reverting it was the LER. It’s tempting to say that the one line is the ‘root cause’ of the incident. That’s a temptation, but it won’t get us anywhere useful, and it certainly isn’t objectively true.
In fact LER actually says nothing about a ‘root cause.’ Consider these possible alternative causes:
- The configuration wasn’t wrong; rather, the code that interpreted the configuration file wasn’t flexible enough.
- The person who wrote the offending line wasn’t given enough context by their team, training, and onboarding process about the configuration to know how it would affect the system.
- The system allowed a configuration change to be deployed globally all at once.
- The parts of the organization closer to operations did not spend enough time with feature developers to understand how the latter might actually try to use the configuration file.
- The person’s management chain did not set the appropriate priorities to allow for more testing, exploration, and review before the change was pushed.
- The peers who signed off on the change were so busy because of external resource constraints that they didn’t have time to properly review the pull request.
- The company doesn’t make enough money to support increasing the deployment costs high enough to run blue/green deployments whereby new configurations are run in parallel with the old until they are verified to be correct.
We could go on, but the important thing to notice here is that the LER is misleading in a post-incident review process. None of these alternatives are fixed as easily as reverting the one line. None of them suggest that they stem from the same ‘root.’ None of these pin blame on the same person. And yet, all of these alternatives are more fundamental to resilience than the one line of a configuration file.
Whither the ‘Root Cause’?
Even with the language to differentiate between ‘root cause’ and the LER, it’s still tempting to think that whatever the ‘root cause’ is, it needs to be identified so that we can take steps to prevent it from happening again. There are two fundamental issues with this. One issue is that the RCA process doesn’t encourage positive outcomes. The other issue, and the one we want to tackle first is that a ‘root cause’ doesn’t actually exist.
At this point, a die-hard positivist might bang their head against a wall and insist that obviously there was an initial condition on which subsequent conditions were dependent leading to the incident. Unfortunately for the positivist, all complex software systems are also sociotechnical systems. They are made up of hardware, code, and humans who maintain it, communicating with each other in inscrutable ways about their incomplete mental models. Sociotechnical systems are not linear. They are not predictable. They are not deterministic. And no matter how much data we gather, they are not objective.
As we enumerated in the list above, there are other contributing factors to the incident. We don’t have to consider the whole world. We can narrow down by focusing only on things that, had they been different, would result in a world where the system would not have gone down. This includes: the configuration software, the software that interprets the configuration, all of the other systems that depend on or affect that, the robustness and fallback functionality of all of the other services involved, management’s decisions on how to hire and train, the culture of unspoken work expectations, inter-organizational relationships and politics, all company-wide resource allocations, etc. Even the personal life of the person who wrote that line is relevant since that could have affected their mental state and perception at the time. With enough time and a good enough imagination, we can come up with a near-infinite list of contributing factors.
In fact the person who wrote the offending line of configuration had recently adopted a new cat, who had been keeping them up late the night before, so perhaps the real ‘root cause’ is the cat? That’s absurd of course, but arbitrarily so. It’s no more or less absurd than drawing a line around the configuration change and calling that the root, or the team dynamics, or any of a million other things.
The selection of one thing as a ‘root cause’ versus another isn’t something that exists out there in reality. It’s something that one particular human projects into their narrative of what happened. It’s a story.
So What is RCA Actually Doing?
Most RCA processes would single out the configuration file as the ‘root cause’ of the incident described above, and blame would fall to the author of that file. This follows a sort of ‘whoever touched it last’ convention of blame.
To paraphrase an example that I first heard from John Allspaw, think of a celebrated success in your company. Perhaps your company hit a revenue target, or released a big new featureful version upgrade, or hit a customer success goal. Now ask yourself what is the one thing that happened, at one specific time, and who is the one person in the company to whom all credit is due for that success?
Of course, that’s a silly question. Everyone is responsible for the success. The whole organization contributed to it in a million different ways over time. Why then do we treat failure as though it is attributable to just one person who did one thing at one specific time? If everyone is responsible for the successes, then why isn’t everyone also responsible for the failures?
The answer is because RCA isn’t actually about finding a ‘root cause’ even if such a thing existed. RCA exists to establish blame. Specifically, it finds blame in the lowest level of the hierarchy, which preserves a perverse sense of order for people higher up the chain.
The higher you go in management the more tempting it is to rely on RCA because it provides an easy shortcut from asking difficult questions and allocating resources to develop more meaningful solutions. Pointing down the chain, establishing blame somewhere on the front line, absolves management from having to dig deeper or do any of the harder, more fundamental work to make the system more robust.
As a side effect of this ineffectual convenience for the hierarchy, an individual is shamed and often punished with additional work to “prevent it from ever happening again.” Because this shame and punishment are sanctioned in official policy, it undermines the psychological safety of people on the team, particularly the freedom to make mistakes without retribution.
When people come to work with every intention of doing what is best for their company, which most people do, and get shamed and punished despite that intention, it disrespects the humanity of everyone in the organization. It demeans the person singled out and crushes their dignity. That’s why we call RCA an inhumane practice.
So RCA doesn’t achieve what it sets out to accomplish. Instead it undermines the psychological safety of individuals, which adversely affects organizational performance, and deprives the organization of meaningful opportunities to learn from undesirable outcomes. What’s the alternative?
Post-incident reviews offer substantial learning potential. Just because RCA is a counter-productive method doesn’t mean we should abandon the opportunity to learn from incidents altogether. Blameless Post-Mortems are a step in the right direction, but they don’t go far enough. Typically these still try to establish a ‘root cause’ and try to preserve psychological safety by not punishing the people who are at fault. Faultless Learning Reviews are a better step in the right direction, since they reinforce the fact that you can’t fault a person for an outcome that they did not intend. The system just has output: some of it we like, some of it we really don’t like, and any one person’s proximity to the latter is largely incidental.
Fortunately we have about forty years of relevant research to draw from in the field of Human Factors and System Safety, which was kicked off in large part by studies of the incident at Three Mile Island. One branch of this research in particular, Resilience Engineering, has made its way into the software industry. Companies like Adaptive Capacity Labs are pioneering the application of Resilience Engineering to large complex software systems running at scale.
At the very highest level, we can say that a better approach to post-incident learning involves interviewing multiple people. These people could be involved in remediation of the incident, they could be maintaining the parts of the system that form the context prior to the incident, they could be domain experts in the services involved, etc. The facilitator’s job is then to take those narratives and find communication gaps: information that some people have, that could also be meaningful to others as they try to do their jobs and keep the system working as well. Improving that communication and transfer of knowledge then becomes an enabler to success, which can only provide benefits in the knowledge work that characterizes our field.
It’s impossible to prevent incidents entirely, but over time these painful experiences can be used to an organization’s advantage. You can learn from them. The more you know about your own system, the easier it is to optimize for properties that you care about.
Complex software systems will not lose their complexity. RCA will only hinder our ability to navigate them with fairness and grace. Resilience Engineering may hold the key to take back our dignity as knowledge workers, overcome naive notions like ‘root cause,’ and learn to navigate our complex systems better.