Top Seven Myths of Robust Systems
One of my favorite topics of discussion within the domain of Availability is mythology. Not dragons and unicorns, which would be undeniably cool, but myths in the sense of made-up stories we tell ourselves to explain things that we don’t understand. There are many things that we as an industry tell ourselves about the nature of robust systems, that actually are demonstrably false. Without further ado, the Top Seven Myths of Robust Systems:
7. Remove the People Who Cause Accidents
This is often referred to as the Bad Apples model of ‘human error’ in Human Factors literature. Intentional sabotage is exceedingly rare, so here we’re talking about people who are somehow unintentionally involved with incidents. Step 1: Identify the people who are involved in incidents most often, don’t pay enough attention, forget to be careful, or who lose situational awareness. Step 2: Remove them from the job. Problem solved, right?
There are many problems with the Bad Apples model. How much does attention cost anyway? What is the metric unit for a careful? Once lost, where does situational awareness go? These colloquial expressions aren’t empirical entities. They don’t have physiological corollaries. They are part of the story that we tell with the benefit of hindsight in order to explain bad outcomes.
Almost everyone who shows up to work wants to do a good job. If it is the case that some people are involved in more incidents than others, the problem is almost always not inherent in the person, but rather a systemic property of the work as it is performed in the pursuit of conflicting goals: resource constraints, communication hindrance, poor alignment, inadequate training, more chaotic conditions, etc.
When we look for someone to blame for an incident, we lose an opportunity to learn how to actually make the system better. Don’t blame individuals. That’s the easy way out, but it doesn’t fix the system. Change the system instead. It’s more difficult, but it’s also more fundamental to the robustness of the system.
6. Document Best Practices and Runbooks
Best practices aren’t really a knowable thing. We can speak about current industry norms or default practices, but complex systems are always unique. Norms don’t always apply, and they are least likely to apply when a system is in an unstable state. Besides, systems fail all the time even when they do follow industry norms. Combine this with the fact that most experts are terrible at elucidating how they actually think during remediation and we have a strong case that thorough documentation of best practices, runbooks, etc actually provide little value in preventing incidents.
5. Defend against Prior Root Causes
We already dismantled Root Cause Analysis in a prior blog post. The Defense in Depth approach to robustness says that the more defenses you add, the safer your system is. Intuitively this makes sense as a physical metaphor. Unfortunately it turns out catastrophic failures in particular tend to be a unique confluence of contributing factors and circumstances, so protecting yourself from prior outages, while it shouldn’t hurt, also doesn’t help very much.
There is also an interesting class of failures that occur when the defense itself — in software the code written to prevent an incident — ends up being what actually triggers or magnifies the incident. Adding defenses can actually introduce new vectors for undesirable system outcomes.
4. Enforce Procedures
Few myths are as likely to backfire as this one. If procedures aren’t being followed, it’s often because the people designing the procedures have a theory of how the work can be done in a way that avoids incidents. The people doing the work then try to apply that theory, find that it doesn’t quite line up, and resort to other practices to get the job done. This institutionalizes distrust, creates extra work, adds to bureaucratic friction, and generally makes both sides of the interaction feel that the other side is not smart.
Both sides — the procedure-makers and the procedure-not-followers — have the best of intentions, and yet neither is likely to believe that about the other. After all, why would someone with good intentions purposefully make your life more difficult? The end result is that the additional bureaucracy inhibits the adaptive capacity of the organization and ultimately makes the system less safe.
3. Avoid Risk
You may have heard the mantra, “Make the right thing easy and the wrong thing difficult.” Or more succinctly, “Add guardrails.” These both sound so reasonable, until you realize that they deprive operators of a crucial and necessary ingredient for the development of their skillset:
Systems run by craftspeople are very resilient because they rely on a high level of adaptability, based on the actors’ expertise, linked to an exposure to frequent and considerable risk.
The adaptive capacity to improvise well in the face of a potential system failure comes from frequent exposure to risk.
It turns out that guardrails do two things:
- They prevent people from interacting with the risks in a system in a way that teaches them where the safety boundary actually is.
- They prevent people from improvising in the way that they need to during an unusual or unplanned circumstance such as an active incident or almost-incident.
As an example of the second point, many deployment environments make it difficult or impossible to ssh into an instance. Under ideal circumstances, this makes sense, because ssh access implies a lack of proper automation particularly in a modern immutable infrastructure. The key phrase here is “under ideal circumstances.” During an active incident, ssh-ing into an instance to tail logs and see what’s actually going on is often the best way to figure out the Least Effort to Remediate (LER) and get the system back online.
Part of the difficulty of dealing with complex software systems is the ‘complex’ part. It makes sense then that if you remove the complexity, you will have an easier time managing the system. For better or worse, removing complexity is not a sustainable option. In fact, more complex systems are demonstrably more robust.
Consider the simplest key-value database that you can imagine. You give it a key and a value, it stores it in memory. You give it a key, it returns the value. Now imagine that we want to make this database robust. We can deploy it in the cloud, persist the keys and values to disk, distribute the data using a consistent hash behind a load balancer, and replicate the data between several regions. In one albeit long sentence, we have very quickly conveyed a complex but tried-and-true method for making the database very robust.
Now go back to that first, simple version. Imagine that we want to make this database more robust, and simpler. It’s impossible. We already started from the simplest we could imagine.
In fact it turns out that complexity and success have a positive correlation with each other. As Jessica Kerr says, “Raise the ceiling for complexity, and you raise the ceiling for success.” If you want to successfully optimize your software for some business-relevant property, that will require adding complexity. So rather than trying to simplify or remove complexity, learn to live with it. Ride complexity like a wave. Navigate the complexity.
1. Add Redundancy
The number one myth we hear out in the field is that if a system is unreliable, we can fix that with redundancy. In some cases, redundant systems do happen to be more robust. In others, this is demonstrably false. It turns out that redundancy is often orthogonal to robustness, and in many cases it is absolutely a contributing factor to catastrophic failure. The problem is, you can’t really tell which of those it is until after an incident definitively proves it’s the latter.
Scott Snook makes the point in his book Friendly Fire: The Accidental Shootdown of U.S. Black Hawks over Northern Iraq, talking about how helicopter pilots push their vehicles faster and further than spec, because they know that the rotors have redundant motors:
Despite designers’ best intentions, redundancy can unwittingly increase the chances of an accident by encouraging operators to push safety limits well beyond where they would have, had such redundancies not been installed.Snook, 2002, p. 210
An isolated case? Not even close. One of the most famous examples of engineering failure is the Space Shuttle Challenger explosion in 1986. NASA was aware of issues with the o-rings sealing the solid rocket boosters since at least the second shuttle mission in 1981, but procedurally decided that it was an acceptable risk. Why? For three reasons, the first of which was: The functionality has redundancy. There was a primary o-ring, and a secondary o-ring. If there had only been one, the engineers never would have signed off on allowing the third mission to proceed, let alone the 51st mission that ended so catastrophically.
Of course this happens in software all the time as well. One common failure mode for distributed systems is that a cache is put in front of a database to handle read requests. This redundancy takes load off the persistent storage in a database system, so operators feel that the system is safer, even though caches tend to be more ephemeral than persistent storage. Unfortunately, it also removes any signal that the operators have for how much load the persistent storage layer can handle. Utilization increases over time without any signal for how close the system is to the boundary of safe operation. One day the cache layer goes down, which is not entirely unexpected, but the sudden influx of requests to the persistent storage quickly overwhelm it. And if there is one thing we know about state layers like persistent storage, it is that performance issues tend to cascade as availability is sacrificed for consistency. Kaboom! The entire system grinds to a halt and the pagers start to sing.
So there you have it: Our top Seven Myths of Robustness. Of course this only highlights what not to do and doesn’t address what we can do to actually make a system robust. We have plenty to say on this topic, now and in the future. Check out the post on Continuous Verification.
We wish everyone a safe system and may your on-call rotation be uneventful!