Root Cause Is for Plants, Not Software

By Courtney Nash | November 17, 2021

Roughly a quarter of the VOID incident reports (26%) either identify a specific “root cause” or explicitly claim to have conducted a Root Cause Analysis (RCA). We consider these data preliminary, however, given the incomplete nature of the overall dataset. As we continue to add more reports, we’ll track this and see if it changes. 

We’re specifically looking into RCA because, like MTTR, it is appealing in its decisiveness and apparent simplicity, but it too is misleading.

MTTR is a Misleading Metric—Now What?

Software organizations tend to value measurement, iteration, and improvement based on data. These are great things for an organization to focus on; however, this has led to an industry practice of calculating and tracking Mean Time to Resolve, or MTTR.  While it’s understandable to want to have a clear metric for tracking incident resolution, MTTR is problematic for a couple of reasons. 

Answering the Unanswered: The VOID Podcast

Whenever we read a company’s public incident report, there are so many questions that ultimately go unanswered. As most people familiar with incident management and analysis know, there’s plenty of information that doesn’t make it into the public writeups of a software incident.

So we thought: what if we could ask the people involved in those incidents to take us back to their experience, and answer some of those unanswered questions? That’s the idea behind the VOID podcast. 

Experts Win the Day (most of the time)

When it comes to outages, we only seem to hear the bad news. In reality, the people who run these systems resolve these issues fairly quickly the majority of the time. They know this, but we haven’t had the data to show this, until now. This finding, that over half of software-related incidents are resolved in under two hours, comes from the Verica Open Incident Database (VOID). The VOID makes public software-related incident reports available to everyone, raising awareness and increasing understanding of software-based failures in order to make the internet a more resilient and safe place.  

Announcing the VOID

The potential for catastrophic outcome is a hallmark of complex systems. It is impossible to eliminate the potential for such catastrophic failure; the potential for such failure is always present by the system’s own nature. Richard Cook, How Complex Systems Fail Today, we’re announcing a project that represents a small step on a long road […]

Chaos Engineering and a Pandemic: How we ended up with The Chaos Community Broadcast

Our vision was to interview leaders in the Chaos Engineering field in a cross between the styles of Between Two Ferns and The Office as awkwardly as possible—we think we’ve succeeded.

A Day in the Life: Tiffany Knudtson

I’ve been friends with James [Wickett] and his wife for a while and he asked me if I would be interested in the role. I worked with James at one of my previous jobs around 10 years or so ago, so I knew that I liked working with him and that he was a really good guy. I was able to trust that what he said about Verica was legit. He said the people here were smart and caring, and that it wasn’t a culture that beat you down. There are a lot of those types of cultures in workplaces today, but Verica is a place that you could feel like you’re being built up. I really like that. It’s a great place to work. I was saying the other day that even if I were to win the lottery, I’d still want to work here.

The Advanced Principles of Chaos Engineering

Chaos Engineering is grounded in empiricism, experimentation over testing, and verification over validation. But not all experimentation is equally valuable. The principles of Chaos Engineering extend to a “gold standard” captured in a set of advanced principles.

What Chaos Engineering Is (and Isn’t)

The Birth of Chaos The beginning of Chaos Engineering goes back to 2008 when Netflix moved from the datacenter to the cloud. The move didn’t go as planned. The thinking at the time was that the datacenter locked them into an architecture of single points of failure, like large databases, and vertically scaled components. Moving […]

A Day in the Life: Karthik Gaekwad

Karthik Gaekwad Austin, TexasAt Verica for: 8 months Ed Note: This is an ongoing series about what a day in the life of various Verica folks looks like. In this post, we chat with Karthik, one of our engineers on the Kubernetes module team. In addition to being Verica’s resident Cloud Native expert, Karthik is an […]

Four Prerequisites for Chaos Engineering

Ed note: This post presumes you have some familiarity with Chaos Engineering, and are considering whether you can start experimenting with it at your organization. If you’re not familiar with Chaos Engineering, here’s a great post to get you up to speed. Chaos Engineering is often characterized as “breaking things in production,” which lends it […]

A Day in the Life: Randall Hansen

Randall HansenVP, User Experience “I live in the burning wreckage of Portland, Oregon”At Verica for: 2 years Ed Note: This is an ongoing series about what a day in the life of various Verica folks looks like. In this post, we chat with Randall Hansen, VP of User Experience. Randall’s super power is seeing the world […]

A Day in the Life: Chantell Nichols

This is an ongoing series about what a day in the life of various Verica folks looks like. In this post, we chat with Chantell Nichols, a Senior Software Engineer on the Platform team. Chantell is a dedicated teammate with an attitude that makes everyone believe anything is possible. She has an insatiable thirst for learning that motivates and inspires everyone around her.

Security Chaos Engineering: How to Security Differently

“The growth of complexity in society has got ahead of our understanding of how complex systems work and fail. And when such complexity fails, we still apply simple, linear, componential ideas as if that will help us understand what went wrong.” —Sidney Dekker, Drift Into Failure1 Rapid technological innovation has presented businesses unique opportunities to […]

From Catastrophe to Chaos in Production

Ed note: This is the second in a multi-part series about the free Security Chaos Engineering report from O’Reilly. This guest post is from Kelly Shortridge, co-author of the report, and originally was posted on the Capsule8 blog. It is re-posted here with her permission. Production is the remunerative reliquary where we can realize value […]

Announcing the Security Chaos Engineering Report

Ed note: This is the first in a multi-part series about the free O’Reilly report co-authored by Aaron Rinehart and Kelly Shortridge.  Information security is broken. Our users, our customers, our world, are entrusting us with more and more of their life and we are failing to keep that trust. Year after year the same […]

Downtime: Bad for Computers, Essential for Humans

“Each person deserves a day away in which no problems are confronted, no solutions searched for. Each of us needs to withdraw from the cares which will not withdraw from us.” —Maya Angelou When I joined Verica I was told that the policy on vacation time was unlimited PTO. No sick days to be concerned […]

The Chaos Engineering Book

I was hired at Netflix to lead the Traffic Team in early 2015. A few weeks later I was also asked to charter a Chaos Engineering Team. At the time, Chaos Engineering was essentially a program called Chaos Monkey with a few supporting blog posts. I wanted to get a feel for what our engineers […]

Security Chaos Engineering on Enterprise Security Weekly

Co-Founder and CEO Casey Rosenthal and Co-Founder and CTO Aaron Rinehart of Verica join Enterprise Security Weekly to talk about Security Chaos Engineering. In this episode, Casey and Aaron discuss how companies of all sizes in different verticals are adopting Chaos Engineering and how Continuous Verification is now gaining adoption. The discussion digs deep into […]

Music in Resilience: The Practice of Practice

You have probably heard the phrase “practice makes perfect”, but let’s take that thought up and out. It is really the experience of practice that is more interesting! Consider a jazz band you’ve heard or seen live, or really any type of music where improvisation plays an important role. Great bands work together effortlessly, and when things go […]

Security Chaos Engineering down in Austin, TX

The Verica crew had a fun time with the fine folks down in Austin, TX at the city’s first DevSecOps Days event. Going to Austin during the holiday season never results in a White Christmas, but this year Austin delivered on some great security and devops talks that should make you cheery all year long.  […]

Safety and Security at Speed

For the 2019 OWASP Global AppSec DC event, which is the largest application security conference on the planet, I brought the MEASURE framework in the form of a talk. One of the goals of the talk was to bring together the worlds of Safety and Security. I believe that Safety fits in the MEASURE framework for […]

The MEASURE of DevSecOps

The start of DevOps was simple enough: get some developers and operations folks together and build something. With together being the keyword. The concept of DevOps came from this notion that operations needed to take part in the Agile movement. Many articles chronicle the history of how Patrick Debois and Andrew Clay Shafer coined the term and […]

It’s a Summer Verication!

You may know us at Verica from our ground-breaking work in the fields of chaos engineering and continuous verification, both from our history at Netflix and UnitedHealth Group and from the articles we have on this very blog, but you probably don’t know the team behind the company. Well, this is a little insight into […]

Verification vs Validation

Many people use the words Verification and Validation interchangeably, which risks the ability to focus on system-level behaviors that correspond to business value. We prefer to use definitions inspired by the field of Operations Research Management. Verification is finding congruence between what you expect from a system and the actual output. Validation is finding congruence between an […]

Top Seven Myths of Robust Systems

One of my favorite topics of discussion within the domain of Availability is mythology. Not dragons and unicorns, which would be undeniably cool, but myths in the sense of made-up stories we tell ourselves to explain things that we don’t understand. There are many things that we as an industry tell ourselves about the nature […]

Inhumanity of Root Cause Analysis

Root Cause Analysis (RCA), a common practice throughout the software industry, does not provide any value in preventing future incidents in complex software systems. Instead, it reinforces hierarchical structures that confer blame, and this inhibits learning, creativity, and psychological safety. In short, RCA is an inhumane practice. Fortunately, there are healthy alternatives to RCA. I’ll […]

Continuous Verification

Three years from now, if you are pushing code into a serious production environment, it will go through a Continuous Verification (CV) pipeline. The software industry’s transition into complex systems is accelerating. The humans designing, building, and operating these complex systems are no longer capable of understanding how all of the pieces fit together. It’s […]

Engineering with
Confidence

Sign up to demo our beta

Featured Posts