The Chaos Engineering Book

I was hired at Netflix to lead the Traffic Team in early 2015. A few weeks later I was also asked to charter a Chaos Engineering Team. At the time, Chaos Engineering was essentially a program called Chaos Monkey with a few supporting blog posts. I wanted to get a feel for what our engineers thought about the practice, so I asked around: “What is Chaos Engineering?” The response I usually heard was, “Oh, that’s when we break stuff in production.”

As cool as that sounds, I could break stuff in production all day long and not provide any value to Netflix. I wanted to keep my job, not cause pointless havoc. So instead, I sat down with my team and defined Chaos Engineering as a proactive discipline to improve availability. We borrowed heavily from the field of Resilience Engineering and other research outside of the software industry to frame a practice of experimentation that follows in the best traditions of western science. Our definition is still published at http://principlesofchaos.org/. The point of Chaos Engineering isn’t to create chaos; it’s to chart a path of confidence through the chaos.

Working at Netflix was a highlight of my journey with Chaos Engineering. Another is co-authoring and publishing “Chaos Engineering: System Resiliency in Practice” with Nora Jones (Jeli.io). And now I am thrilled to have the opportunity to give away free e-copies of my book.

Why Chaos Engineering is So Important Now

Remote communication is literally saving lives during the pandemic. Never before have so many people relied on digital infrastructure to perform even basic tasks like getting food. The massive migration to remote work and remote education has irrevocably changed the course of societal structures, norms, and communication.

Work and school for hundreds of millions of people moved online in a matter of weeks. This is a monumental change to the most complex sociotechnical system we will ever be a part of. We don’t have access to a control group in this experiment. We only have one shot to get this right. The stakes are high.

With high stakes comes the opportunity for high value. The safety and reliability of the digital systems that we are suddenly so dependent on has never been more obvious. We have the opportunity and the responsibility to innovate in the areas of availability and security, to make these systems better in ways that matter to people.

The old methods won’t cut it. They aren’t bad, but they aren’t sufficient. Incident response management, alerting, metrics/logging, disaster recovery—all great, but also all reactive. They focus on time-to-detect and time-to-remediate. We need proactive methods. TDD, pair programming, peer code reviews, syntax scanning, QA—also great, but they won’t move the needle on availability or security in a complex system. You can’t expect a human to provide safety guarantees on something (a complex distributed system) that by definition is beyond the capability of a human to mentally model.

Chaos Engineering is a proactive approach to improving the safety properties of complex systems. We have an imperative to lean into these new, innovative approaches to help us cope with the increasing complexity and stress that our organizations are in. If you feel like your organization is facing unprecedented demands and requirements from a reliability perspective, you are in good company: most of us operating systems at scale are in that same position. With Chaos Engineering, you have an opportunity to meet those demands and navigate that complexity.

In the Book

This book explains where Chaos Engineering came from and provides some mental models to challenge the current mainstream thinking on system reliability. We then provide chapters contributed by authors from Slack, Google, Microsoft, LinkedIn, and CapitalOne, so that you can hear in their own words how people responsible for critical systems at scale are embracing Chaos Engineering to meet the challenges of the day. We also explore a bit outside of the typical boundaries of distributed software engineering to take a look at the future of the practice as well as its impact in manufacturing, autonomous vehicles, human systems, cyber security, and Continuous Verification.

We Want You to Have the Book… for Free

Nora and I wrote the most comprehensive, practical guide to Chaos Engineering. We even dedicated an entire chapter to establishing ROI so that you can see how Chaos Engineering has a positive effect.

Now Verica is sponsoring this book so that we can send you a digital copy for free. As a company, we believe the concepts explored within have the potential to completely change for the better how people build, operate, and maintain systems at scale. To get your free copy, go to verica.io/book.