VOID 2022 Report Now Available
One year and almost 10,000 incident reports later, the 2022 VOID Report is here! After scrutinizing an entire year’s worth of incidents one thing is crystal clear: Resilience saves time. Taking the time to understand how to better respond when something green turns red—learning from the people, the processes, and the systems—will make the next incident smoother. Because, yes, there will always be the next incident.
It has been an incredible year, and we wouldn’t be able to release a report with this volume of in-depth data without our community. So before we jump into the key findings, we want to share genuine gratitude to a handful of people and companies who made this possible:
- The Learning From Incidents community, which inspired and championed this work from the very beginning.
- Sponsors Jeli, Indeed, Scotiabank, Stanza, and Adaptive Capacity Labs, who all share our vision of increasing cross-industry dialogue, transparency, and sharing that will help companies better track, prepare for, and mitigate inevitable disruptions.
- Friend, mentor, and resilience raconteur Dr. Richard Cook, who will be greatly missed.
“Incidents are untyped pointers to areas of interest in their system. The ‘untyped’ part is critical: the work of decoding what has happened is entirely left to the analyst.”Dr. Richard Cook
No company is immune from incidents.
Incidents happen in organizations of all sizes, from startups to the Fortune 10. Software is mission-critical in every possible industry including banking, travel, agriculture, commerce, and more.
The key findings include:
- Length isn’t as cut and dry as it appears: there are many insightful metrics to measure in an incident. Duration of incidents conveys little meaning about the incidents themselves, in part because it can be very tricky to attribute when incidents start or stop.
- SREs and others in similar roles should retire MTTR as a key metric. This year’s report confirms that MTTR isn’t a viable metric for the reliability of complex software systems for a myriad of reasons, particularly because averages of duration data lie.
- Common assumptions around incident duration and severity are debunked. Incident duration and severity are not related, and we have the in-depth data to prove it.
- Organizations are moving away from shortsighted approaches like RCA. Root Cause Analysis appears to be on the decline in orgs of all sizes, as they move toward more meaningful metrics and analysis.
In the past, the publication of software incident reports have been scattered across the Internet—it’s often difficult to link directly to them, or they are sequestered in corners of company websites.
The VOID solved this problem by collecting these reports together to help improve the software running key areas such as transportation, infrastructure, power grids, healthcare devices, voting systems, autonomous vehicles, and many critical societal functions.
We can’t wait to have the community dig into these results, share with their colleagues, and learn and grow together as we build a discipline that is (counterintuitively) strengthened through failures.
If you are interested in bringing a culture of learning from incidents and best-practices into your organization, schedule time with the VOID team and see what else you can do with a VOID membership, including quarterly learning labs, direct access to experts, and more. And, as always, we’d love your help in making the VOID more comprehensive. You can submit any incidents that aren’t included in the database with this short form.
Courtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s held a variety of editorial, program management, research, and management roles at Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon. She lives in the mountains where she skis, rides bikes, and herds dogs and kids.
Get Ahead of the Incident Curve
Discover key findings which confirm that accepted metrics for incidents
aren’t reliable and aren’t resulting in correct information.