Answering the Unanswered: The VOID Podcast
Whenever we read a company’s public incident report, there are so many questions that ultimately go unanswered. As most people familiar with incident management and analysis know, there’s plenty of information that doesn’t make it into the public writeups of a software incident.
So we thought: what if we could ask the people involved in those incidents to take us back to their experience, and answer some of those unanswered questions? That’s the idea behind the VOID podcast.
In our first episode, we chat with two engineers from observability company Honeycomb. Liz Fong-Jones and Fred Hebert take us back to early 2021, when they dealt with a series of outages related to their Kafka architectural migration, culminating in a 12-hour incident, which is an extremely long outage for the company.
We cover a wide range of topics beyond the specific technical details of the incident (which we also discuss), including:
- Complex socio-technical systems and the kinds of failures that can happen in them (they’re always surprises)
- Transparency and the benefits of companies sharing these outage reports
- Safety margins, performance envelopes, and the role of expertise in developing a sense for them
- Honeycomb’s incident response philosophy and process
- The cognitive costs of responding to incidents
- What we can (and can’t) learn from incident reports
We hope you enjoy the episode, and if you have an incident you’d like to come on the podcast and discuss with us, give us a shout!
Courtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s held a variety of editorial, program management, research, and management roles at Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon. She lives in the mountains where she skis, rides bikes, and herds dogs and kids.
Get Ahead of the Incident Curve
Discover key findings which confirm that accepted metrics for incidents
aren’t reliable and aren’t resulting in correct information.