Announcing the Security Chaos Engineering Report

Ed note: This is the first in a multi-part series about the free O’Reilly report co-authored by Aaron Rinehart and Kelly Shortridge. 

Information security is broken. Our users, our customers, our world, are entrusting us with more and more of their life and we are failing to keep that trust. Year after year the same sort of attacks are successful and the impact of those attacks becomes greater. Meanwhile the security industry keeps chasing after shiny new solutions to yesterday’s problems.

Hope isn’t a strategy. Likewise, perfection isn’t a plan. The systems we are responsible for are failing as a normal function of how they operate, whether we like it or not, whether we see it or not. Security Chaos Engineering (SCE) is about increasing confidence that our security mechanisms are effective at performing under the conditions for which we designed them. Through continuous security experimentation we become better prepared as an organization and reduce the likelihood of being caught off guard by unforeseen disruptions. These practices better prepare us as professionals, our teams, and the organizations we represent to be effective and resilient when faced with security unknowns.

The advanced state of practice for how we build software has reached a state where the systems we build have become impossible for any single person to understand or hold in their head. Our systems are now vastly distributed and operationally ephemeral. Transformational technology shifts such as cloud computing, microservices, and continuous delivery (CD) have each brought new advances in customer value but have in turn resulted in a new series of challenges. Primary among those challenges is our inability to maintain an accurate understanding of our own systems. 

This is a problem that I discovered firsthand while serving as the Chief Enterprise Security Architect at UnitedHealth Group. I always felt I had an imperfect understanding of how our systems are designed or what they really did. A data architect or a solutions architect would come to me with different diagrams of the same system—each represented their mental models of how they perceived the system. And as a security architect, I was trying to think about, “How do I design security into this, how do I give them the guidance they seek to do the best they can at protecting the information the system transacts?” As a result of this conflict I would end up making the best with the limited information I had to ensure the right security guidance was given. Without an accurate depiction of the system to begin with I was never sure if the security controls I recommended were implemented correctly or if they were as effective at providing the security I designed them for. I needed a way to skip through all the noise and ask the system the question directly.

This led to developing a tool (that we eventually open sourced) called ChaoSlinger. By generating experiments to test various hypotheses about how our security systems work, I was able to prove that many of our security controls were not working as designed nor expected. By intentionally introducing a failure mode or other event, our team discovered how well instrumented, observable, and measurable our systems truly were. We were able to validate critical security assumptions, assess abilities and weaknesses, then move to stabilize the former and mitigate the latter. We even found benefits for our incident response teams, helping them better model and understand the outputs of the tools they used to manage incidents. The value this tool provided us was immense, and was the kernel of Security Chaos Engineering. 

Unbeknownst to me at the time, I was developing the first ever Security Chaos Engineering program.  After I began to socialize the value of what we had learned with ChaoSlingr, I ran across a keynote at BlackHat that my co-author Kelly Shortridge gave with Dr. Nicole Forsgren talking about the value of adopting Chaos Engineering within cyber security. This quickly made me realize I was not alone on this journey. Together Kelly and I decided to join forces and write a book, which is forthcoming! In the meantime, the report is a shot over the bow, a dismantling of many common misperceptions about security in complex systems. We reassess the first principles underlying organizational defense and pull out the failed assumptions by their roots. In their place, we are planting the seeds of the new resistance, and this resistance favors alignment with organizational needs and seeks proactive, adaptive learning over reactive patching.

For people who are compelled by this new approach but don’t know where to start, I definitely recommend starting small. A well-conceived experiment that provides data for engineers is far more persuasive than a request for time or money to “try something new.” One great place to start is by running a game day where you conduct some security chaos experiments. Game days immediately generate valuable results by injecting failure during a live-fire exercise. The intention is to develop a series of hypotheses associated with security components, plan an experiment, and execute the experiment to either validate or refute those hypotheses. Security chaos game days are a way of introducing controlled, safe, and observed experiments that allow engineers to observe how well their security responds to real-world conditions. During these sessions, teams often exercise their recovery procedures and security control effectiveness and sharpen their resolution protocols such as runbooks.

The report also features case studies in how Capital One and Cardinal Health implemented SCE programs (in highly regulated industries to boot), and highlights a couple tools you can use to model experiments on your team. 

SCE—and Chaos Engineering itself—is still an emerging and advancing set of practices, techniques, and toolsets. It’s true that the shift in mindset is going to be tough for many organizations initially, especially for those clinging to a traditional security ethos. But if we begin to challenge our assumptions, cultivate collaboration rather than silos, and learn through experimentation, we are certain to make a profound impact on an industry that so desperately needs it.

In the next post in this series, Kelly takes us from catastrophe to chaos in production, detailing three possible security failures from the report, providing specific chaos experiments you can conduct to help you better understand how your systems will respond in the face of such failures.

Verica Co-founder & CTO

Aaron Rinehart

Aaron Rinehart has been expanding the possibilities of chaos engineering in its application to other safety-critical portions of the IT domain notably cybersecurity. He began pioneering the application of security in chaos engineering during his tenure as the Chief Security Architect at the largest private healthcare company in the world, UnitedHealth Group (UHG). While at UHG Rinehart released ChaoSlingr, one of the first open-source software releases focused on using chaos engineering in cybersecurity to build more resilient systems. Rinehart recently founded a chaos engineering startup called Verica with Casey Rosenthal from Netflix and is a frequent author, consultant, and speaker in the space.