From Catastrophe to Chaos in Production

Ed note: This is the second in a multi-part series about the free Security Chaos Engineering report from O’Reilly. This guest post is from Kelly Shortridge, co-author of the report, and originally was posted on the Capsule8 blog. It is re-posted here with her permission.

Production is the remunerative reliquary where we can realize value from all of this software stuff we wrangle day in and day out. As has been foretold from crystal balls through to magic eight balls, the deployment and operation of services in production is increasingly the new engine of business growth, and few industries are spared from this prophetic paradigm foretold by the Fates. A rather obvious conclusion is that we should probably keep these systems safe—that our endeavors should maximize confidence that these systems will operate as intended.

“Keep prod safe?” you say, incredulous that I would impart such trite advice. Rest easy, for the discussion of how we should approach keeping prod safe turns from trite to tart! Simply put, the best way to keep production systems safe is by harnessing failure as a tool rather than a stain to wash out in a Lady Macbethian fashion. If we seek to recover from failure as swiftly and elegantly as possible so that service disruption is minimized, we must wield failure as a magical staff with which we can illuminate essential knowledge of our systems.

Thus is the way of Security Chaos Engineering. To whet your appetite for implementing chaos magic in your own organization, this post will explore how failure can manifest in production systems and provide examples of security chaos engineering experiments that can help you gain the confidence in production that any venerable coven of software sorcerers deserves.

Failure in Production

The natural response to fear of failure in production is an attempt to prevent any and all potential issues from manifesting in prod. In the realm of security, this often looks like removing vulnerabilities before software is deployed to prod. While that can certainly be useful, it’s unrealistic to assume that all problems will be caught and remediated ahead of deployment—especially in components over which you lack control, like container runtimes, orchestrators, or the underlying kernel itself. Deploying to production will always involve some level of ambiguity, despite the preference of human brains to the contrary.

Another wrinkle in the fight against failure in prod is that said failure can manifest in a mess of multiplicitous manners. Production systems can consist of a diverse range of infrastructure, including private or public clouds, VPSs, VPCs, VMs, containers, serverless functions, and, one distant day, perhaps even computerless compute!

As a result, production systems are justifiably classified as complex systems full of interrelated components that affect each other. Failure, therefore, is like a tapestry of interwoven strands that can spread fire to the rest and engulf the whole thing in flames. That is, there is a strong risk of contagion in prod, which can transform an event in one node into a poison that seeps into other components and severely disrupts your services.

To complicate matters further, there is a dizzying array of activity that can jeopardize production operations. We can simplify this activity into two high-level types of unwanted activity: deliberately malicious (attackers) and accidentally careless (devs). For instance, attackers can spawn a remote interactive shell, disable SELinux, and exploit a kernel vulnerability to get root access. Or, developers can download production data or accidentally delete log files. Oftentimes, these two types of activity can look pretty similar. Attackers and developers can both be tempted to attach a debugger to a production system, which can facilitate privilege escalation or exposure of sensitive information. 

Finally, when we think about production failures, we would be remiss not to note that most production infrastructure runs on Linux, where everything is a file. This means that failure in production often bubbles up like rancid alphabet soup (which I’ll deem “Spaghetti-uh-ohs”) from unwanted file-related activity.

To make these considerations more concrete, let’s explore three examples of failure in production (of which countless more are available in the Security Chaos Engineering report):

  1. Log files are deleted or tampered. To be honest, if this occurs, your ops is likely screwed. Log pipelines are critical for operational monitoring and workflows, so telling your local ops or SRE that this failure has happened likely counts as a jump scare.
  2. Boot files, root certificate stores, or SSH keys are changed. Well, this presents quite the pickle! Modification of these critical assets is akin to defenestration of your system stability—even aside from the potential stability snafus arising from an attacker maintaining access and rummaging around the system.
  3. Resource limits are disabled. This activity is highly sus and doubtless disastrous. Whether due to an infinite script or a cryptominer gorging itself on compute like a mosquito on blood, the dissipation of your resource limits can lead to overhead overload in your system that results in service disruption. 

Now, how do we constructively cope with this complexity? By replacing catastrophe with chaos.

Security Chaos Engineering in Production

The TL;DR on Security Chaos Engineering (SCE) is that it seeks to leverage failures as a learning opportunity and as a chance to build knowledge of systems-level dynamics towards continuously improving the resilience of your systems. The first key benefit of SCE is that you can generate evidence from experiments which will lay the foundation to a greater understanding of how your systems behave and respond to failure. 

Leveraging that understanding, a second benefit is that SCE helps build muscle memory around responding to failure, so incidents become problems with known processes for solving them. SCE experiments not only allow teams to continuously practice recovering from failure, but also encourage software engineering and security teams to practice together—so when a real incident transpires, an accomplished alliance can rapidly respond to it.

Another core benefit of SCE for production systems is its utility for grokking complex systems. Identifying relationships between components in production helps reduce the potential for contagion during incidents, which helps you recover and restore service faster. Performing security chaos experiments can facilitate this discovery, since simulating failure in one resource can exhibit which of the resources connected to it are also impacted.

Naturally, if you want to proactively understand how your prod systems respond to certain failure conditions, you really have no choice but to conduct chaos tests in production. The reality is that if you don’t run your tests in prod, you won’t be as prepared when inevitable incidents occur. Of course, it’s pretty reasonable that your organization might be reticent to start SCE experiments in prod, so staging environments can serve as the proving grounds for gaining an approximation of how your systems respond to failure. This should be viewed as a stop-gap, however; it’s important to have a plan in place to migrate your SCE tests to prod (the SCE report enumerates those steps).

“Enough of philosophy,” you cry out, “show me some chaos summoning spells!” A reasonable request, dear reader. Let’s now explore three examples of SCE experiments in production and what important questions they answer (again, this is just a sample to get you thirsty for the full list available in the free SCE report).

  1. Create and execute a new file in a container. Is your container actually immutable? How does your container respond to new file execution? Does it affect your cluster? How far does the event propagate? Are you containing impact? How quickly is the activity detected?
  2. Inject program crashes. Is your node restarting by itself after a program crash? How quickly are you able to redeploy after a program crash? Does it affect service delivery? Are you able to detect a program crash in prod?
  3. Time travel on a host—changing the host’s time forward or backward (my fav test tbh). How are your systems handling expired and dated certificates or licenses? Are your systems even checking certificates? Do time-related issues disrupt or otherwise bork service delivery? Are you relying on external NTP? Are time-related issues (e.g. across logging, certificates, SLA tracking, etc.) generating alerts?

Parting Thoughts

As in life, failure in production is inevitable…so you might as well turn it into a strength by learning from it early and often. Experimentation with chaos magic by injecting failure into prod systems uncovers new knowledge about your systems and builds muscle memory around incident response. This is why Security Chaos Engineering is the optimal path forward to build confidence in the safety of our production systems—the moneymakers on which our organizations desperately depend.

VP of Product Management and Product Strategy, Capsule8

Kelly Shortridge

Kelly Shortridge is VP of Product Management and Product Strategy at Capsule8. In her spare time, she researches applications of behavioral economics to information security, on which she’s spoken at conferences internationally, including Black Hat, AusCERT, Hacktivity, Troopers, and ZeroNights. Most recently, Kelly was the Product Manager for Analytics at SecurityScorecard. Previously, Kelly was the Product Manager for cross-platform detection capabilities at BAE Systems Applied Intelligence as well as co-founder and COO of IperLane, which was acquired. Prior to IperLane, Kelly was an investment banking analyst at Teneo Capital covering the data security and analytics sectors.