More Reports on Near Misses, Please
Editor’s note: This is the fourth part of a series that will focus on each one of The VOID Report 2021 key findings. If you haven’t had a chance to download the report, it’s not too late. You can download the report at thevoid.community/report.
The final pattern we’ll cover is the lack of “near miss” incident reports. We classify a near miss as an incident that the organization noted had no noticeable external or customer impact, but still required intervention from the organization (often to prevent an actual public incident from eventually happening). As of today, there are only nine near-miss reports in the VOID—not even a half of a percent of the total entries. While this isn’t a surprising aspect of the reports in the VOID, we call it out here as an important opportunity to establish a baseline and track how it changes over time.
While we don’t yet objectively know this, we suspect there are a number of reasons why organizations don’t publish near miss reports:
- Near Misses Can Be Harder to Detect
In the context of complex systems, a near miss can be quite difficult to discover, given that failures often emerge from a particular combination of contributing factors. The path to an incident is a slow drift towards failure, where discerning signal from noise is challenging, and engineers often don’t know when they are nearing a potential safety boundary. These systems also exist in some form of degradation, and it’s potentially more work than a team can tackle to try to chase all the possible faint signals.
- They’re Not Considered a Formal Incident
If by nature a near miss doesn’t have any customer impact, then it may not reach the level of “incident” internally, and hence might not receive as much attention, resources, and much less the kind of retrospective analysis that a customer-facing incident would. In this regard, they could be organizationally perceived as additional work that shouldn’t be prioritized over more pressing requirements.
- Public Perception and/or Competitive Factors
A near miss report also likely makes lawyers and PR teams nervous. Depending on the organization’s attitude towards failure, transparency, and needing to “control the narrative,” they may be concerned that sharing these types of events could reflect poorly on them or unnecessarily reveal competitive or other proprietary information.
Near Misses Provide Deeper Data
Near miss reports often include much richer information about sociotechnical systems, including breakdowns in communication, dissemination of expertise, and cultural or political forces on those systems. In his coverage of near miss studies (“close calls”) in the aviation industry, Carl McCrae notes that near misses “provide the source materials to interrogate the unknown, test current assumptions, invalidate expectations, become aware of ignorance and undergo experience.” They reveal gaps in current knowledge, mental models, and the range of assumptions that practitioners have about their systems, before those gaps lead to more significant incidents.
“Near misses are generally more worth our time, because they come without the pressure of dealing with post-incident fall-out and blame, and allow a better focus on what happened.” —Fred Hebert, Honeycomb SRE
The lack of pressure inherent in non-customer-facing incidents leaves breathing room that encourages less-rushed thinking and exploration when revisiting the events, while also avoiding pitfalls around root cause and blame (as discussed in the previous post). A near miss is a form of success, after all, so as humans, we’re less prone to hindsight bias—thinking something was predictable after the fact—and defensive attributions—the desire to defend ourselves from the concern that we will be seen as the cause, or the victim, of a mishap—than we might be in the case of a public-facing incident.
In their series of post-mortems about how the Gamestop short squeeze frenzy wreaked havoc on Reddit, engineers Courtney Wang, Fran Garcia, and Garrett Hoffmandelved into unprecedented detail about the various ways their systems failed under the duress of a deluge of traffic, but also wrote up near-misses that helped them better understand the ways in which their systems were successful:
“The r/WallStreetBets events taught us a great deal about how our systems worked, but more importantly they taught us many things about how we as a company work. Even though the focus of these stories revolve heavily around technical triggers, they also highlight how every team at Reddit played an important role in containing and mitigating these incidents.”
Potential Near-Miss Patterns
We plan to dig deeper and more systematically into the information we can gather from near miss reports as we find them. For now we’ll highlight a few patterns and potential areas for future research.
One pattern we’re seeing from reviewing these near miss reports is that the predicating event is typically an engineer (or other practitioner) noticing that something seems “off,” such as this GitLab report on The Consul Outage That Never Was:
“The issue came to our attention when a database engineer noticed that one of our database servers in the staging environment could not reconnect to the staging Consul server after the database node was restarted.”
Or this report from PrometheusKube:
“It’s Tuesday morning. We receive a message from a Developer that we are missing some of the logs in production. Strangely enough, it appears that logs only from a single node are missing.”
And this UK governmental investigation into a series of flights that took off with calculated passenger weights that were dramatically different from the actual passenger weights:
“[The flight crew] noticed that there was a discrepancy, with the load sheet showing 1,606 kg less than the flight plan. They noted that the number of children shown on the load sheet was higher than expected, at 65, compared to the 29 which were expected on the flight plan. The commander recalled thinking that the number was high but plausible; he had experienced changing loads on the run-up to the temporary grounding as passengers cancelled and altered trips at short notice.”
These cases all highlight the role that practitioner expertise plays in awareness of system safety boundaries and conditions that can lead to incidents. The subsequent investigations tend to reveal more information about where that expertise stemmed from, which often exposes the gaps in knowledge for other people/teams, along with potential ignorance or assumptions based on those gaps.
“Oops, I Did It Again”
Near-miss reports are also a unique opportunity to reframe the debate around human error. In a near-miss situation, someone committing a typo or “bad” configuration change that doesn’t fully take down a service is a chance to understand how the system can be changed to avoid an actual catastrophe in the future.
In this Cloud Native Computing Foundation talk from 2019, a Spotify engineer details how, by clicking on the wrong browser tab, he managed to delete their entire US production Kubernetes cluster. In the end, the good news was that there was no end-user impact—despite the fact that the team’s efforts to recover from the initial event led to deleting the Asian cluster, which led to a series of efforts that then also took out the US cluster yet again! (This is another meta-incident pattern that we also hope to investigate in the future, whether interventions in a current incident unknowingly lead to future incidents.)
As we find more near-miss reports, we plan to look into whether these analyses are less prone to finger pointing or agentive blame-focused language. It will also be interesting to compare formal incident reports and near miss reports within and across organizations.
“Mind the Gap”
We also strongly suspect that near-miss investigations provide analyses that better uncover gaps in knowledge and misalignments in socio-technical systems. These can include things like:
- Expertise held by only a few individuals
- Assumptions about how systems work
- Pattern matching behaviors that lead to “garden path” thinking based on previous incidents or known degraded behaviors
- Lack of knowledge across teams about how systems work
- Production and business pressures (new features, constrained team members, insufficient funding for improvements)
Consider again The Consul Outage That Never Happened:
>> Engineer noticed something was “off”: “The issue came to our attention when a database engineer noticed that one of our database servers in the staging environment could not reconnect to the staging Consul server after the database node was restarted.” >> Gap in collective knowledge w/in the org: “After looking everywhere, and asking everyone on the team, we got the definitive answer that the CA key we created a year ago for this self-signed certificate had been lost.” >> Transitory knowledge, situation normal: “These test certificates were generated for the original proof-of-concept installation for this service and were never intended to be transitioned into production. However...the expired test certificate had not been calling attention to itself.” >> Production pressures, limited people: “...a year ago, our production team was in a very different place. We were small with just four engineers, and three new team members: A manager, director, and engineer, all of whom were still onboarding.” >> Surprises, gaps in knowledge: “We were unsure why the site was even still online because if the clients and services could not connect it was unclear why anything was still working. ” >> Precarious conditions, safety boundaries becoming clear(er): “Any interruption of these long-running connections would cause them to revalidate the new connections, resulting in them rejecting all new connections across the fleet.” >> Efforts to fix make it worse: “Every problem uncovered other problems and as we were troubleshooting one of our production Consul servers became unresponsive, disconnected all SSH sessions, and would not allow anyone to reconnect.” >> Further safety boundaries and degraded state uncovered: “Not having quorum in the cluster would have been dangerous when we went to restart all of the nodes, so we left it in that state for the moment.” >> Inherent risk in solving the problem: Multiple solutions are considered, all “involve the same risky restart of all services at once.” >> Socio-technical realities: “While there was some time pressure due to the risk of network connections being interrupted, we had to consider the reality of working across timezones as we planned our solution.”
This complex, high-tempo coordinated work was only possible due to a depth of system expertise from the operators (they left the work with the team that had figured out the solution, letting them sleep and then get back to it the next day). And yet, the proposed solution led to more knowledge gaps, more troubleshooting, and additional fixes that were required to proceed. Even after the big main coordinated fix went out, there were still a few lingering issues.
Another near miss example comes from Mailchimp, who experienced a puzzling, multi-day internal incident that didn’t affect customers but “prompted a lot of introspection.” The conclusions from their analysis were almost entirely social, cultural, and historical:
MailChimp: Computers are the Easy Part
>> Production pressures: “...many of our on-call engineers were already tired from dealing with other issues” >> Not the usual suspect: “...the only change that had landed in production as the incident began was a small change to a logging statement, which couldn’t possibly have caused this type of failure. ” >> Cost of coordination: “By the second day, the duration of the incident had attracted a bunch of new responders who hoped to pitch in with resolution.” >> Mental model mismatch driven by cultural and historical factors: “...we realized that this was a failure mode that didn’t really line up with our mental model of how the job system breaks down. This institutional memory is a cultural and historical force, shaping the way we view problems and their solutions.” >> Hard-earned human expertise: “...a couple of engineers who had been observing realized that they’d seen this exact kind of issue some years before.” >> Difficulty adapting in the face of novelty: “...a software organization will tend to develop histories and lore—incidents and failures that have been seen in the past and could likely be the cause of what we’re seeing now. This works perfectly fine as long as problems tend to match similar patterns, but for the small percentage of truly novel issues, an organization can find itself struggling to adapt.” >> Challenges with changing the frame: “Given the large amount of cultural knowledge about how our job runner works and how it fails, we’d been primed to assume that the issue was part of a set of known failure modes.”
These near miss reports tackle internal incidents in ways that few public reports do. Of note, they capture things like knowledge gaps, heuristics and mental models, transitory or siloed knowledge, operating pressures, cognitive processes and biases, and other largely social or organizational factors typically absent in traditional incident reports. We strongly suspect (and yes, plan to study) that organizations that engage in these efforts will have better adaptive capacity to handle disruptions and incidents than their counterparts that don’t.
If you’ve found this interesting or helpful, help us fill the VOID! Submit your incidents at thevoid.community/submit-incident-report. Read more about the VOID and our key findings in The VOID Report 2021.
Courtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s held a variety of editorial, program management, research, and management roles at Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon. She lives in the mountains where she skis, rides bikes, and herds dogs and kids.