Continuous Verification

Three years from now, if you are pushing code into a serious production environment, it will go through a Continuous Verification (CV) pipeline.

The software industry’s transition into complex systems is accelerating. The humans designing, building, and operating these complex systems are no longer capable of understanding how all of the pieces fit together. It’s inevitable, and it’s also okay. Instead of fighting the complexity, we are learning to navigate it.

There are two crucial requirements for navigating complex software systems:

  1. Have the capability to quickly and safely reverse design decisions that make it into the production environment. This is another way of saying fail fast and recover quickly.
  2. Align teams around the output of the system (verification) rather than the internal details (validation) of architecture.

Continuous Verification encourages both of these requirements in a way that proactively educates engineers about the systems they operate. It is emerging as a crucial practice for navigating complex software systems.

The crucial conversations around this nascent discipline are well underway, and the progression to CI/CD/CV is being worked out in theory and in practice. Here is an excerpt from the upcoming book Chaos Engineering: System Resiliency in Practice published by O’Reilly Media, due out in late 2019.

“When a gap of expectations exists between two or more engineers separately writing code, the interaction of that code may produce unexpected and undesirable results. The quicker that can be caught, the more likely it is that the next code written will have fewer gaps. Conversely, if that gap is not caught early on, then the next code written will likely diverge even further, multiplying the opportunities for undesirable outcomes.

“One of the most efficient methods for uncovering that gap is to put the code together and run it. Continuous Integration (CI) was promoted heavily as part of XP methodology as a way to achieve this. CI is now a common industry norm. CI pipelines encourage the development of integration tests which specifically test the functionality of the combined features of the code written be separate developers or teams.

“With each edit to code published to a common repository, a CI pipeline will compile the new amalgamation and run the integration test suite to confirm that no breaking changes were introduced. This feedback loop promotes Reversibility, one of the pillars of the EPC model discussed in the chapter on Complex Systems.

“The practice of Continuous Delivery (CD) builds on the success of CI by automating the steps of preparing code and deploying it to an environment. CD tools allow engineers to choose a build that passed the CI stage and promote that through the pipeline to run in production. This gives developers an additional feedback loop (new code running in production) and encourages frequent deployments. Frequent deployments are less likely to break, because they are more likely to catch additional expectation gaps.

“A new practice is emerging that builds on the advantages established by CI/CD. Continuous Verification (CV) often manifests as a stage in a pipeline, but it can also run independently during business hours, verifying assumptions about the behavior of a system. Much like Chaos Engineering, CV platforms can include Availability or Security components and often express these as hypotheses. Like CI/CD, the practice is born out of a need to navigate increasingly complex systems. Organizations do not have the time or other resources to validate that the internal machinations of the system work as intended, so instead they verify that the output of the system is inline with expectations. This is the preference for verification over validation that hallmarks successful management of complex systems.”

There are many properties of a complex system that cannot be designed in the traditional sense, but that a business still needs to optimize. Take availability as an example. There are many approaches to improving availability. Some examples include: Incident Management, Alerting, Service Degradation, Observability Instrumentation, Disaster Recovery, Postmortem Facilitation, etc. What all of these approaches have in common is that they are aimed at improving time to detection (TTD) and time to remediation (TTR) of an incident. They are all reactive.

When we sat down to define Chaos Engineering, we set out to build a discipline that was proactive. Chaos Engineering draws from the rich history of empirical experimentation to proactively discover vulnerabilities in complex systems. The culmination of this was the Chaos Automation Platform (ChAP) built by my team at Netflix.

ChAP sets up experiments with a control group and a variable group in the production environment and verifies whether or not the system behaves as expected under adverse conditions. The hypothesis of these experiments is always of the form: Under X conditions, customers still have a good time. ChAP alerts the service owners whenever it can disprove one of those hypotheses.

ChAP is one example of a Continuous Verification tool. Verica takes that to another level, with an enterprise-grade tool that integrates with Kubernetes and Kafka out of the box. Automated experimentation platforms can come in all shapes and sizes, all forms and functions. We can optimize for availability, security, performance, resource utilization, cost, and many other business priorities within this type of framework.

Continuous Verification is a game changer for complex software system management. Today it is changing how we operate systems. In the future it will fundamentally change the scale and types of systems that we even consider building. It’s an exciting time to be working in this space.