This is a continuation of the PodSecurityPolicy is Dead, Long live...? article, which looks at how to construct the most effective policy for your Kubernetes infrastructure. Haven't read that? Check it out first.
Based on that foundation, this article looks at how versioning policies streamline the developer experience to deliver features and minimise downtime whilst meeting compliance requirements.
“Policy as code” is one of the more recent 'as-code' buzzwords to enter the discourse since ‘infrastructure-as-code’ paved the way for the *-as-code term. The fundamental principles of it sound great: everything in version control, auditable, repeatable, etc. However, in practice, it can often fall apart when it comes to the day 2 operational challenges which are exacerbated by adopting ‘GitOps’.
We'll look at a common scenario and present a working example of versioned policy running through the entire process to address the issue.
Let’s start with a likely (simplified) scenario:
That's just your 'business as usual' flow for all your devs.
Your policy engine might allow you to ‘dry run’ before you ‘enforce’ a new policy rule by putting it in a ‘warning’ or ‘audit’ or ‘background’ mode where a warning response is returned in the event log when something breaks the new rule.
But that will only happen if the API server re-evaluates the resources, which usually only occurs when the pod reschedules. Again, someone needs to be monitoring the event logs and acting on them, which can introduce its own challenges in exposing those logs to your teams.
All of that activity is happening a long way from the developers that are going to do something about it.
Furthermore, communicating that policy update between the well-intentioned security team and developers is fraught with common bureaucratic concerns frequently found in organisations at scale. The security policy itself might be considered somehow sensitive as it may reveal potential weaknesses.
Consequently, reproducing that policy configuration in a local development environment may also prove impracticable. This is all made much, much worse with multiple clusters for development, staging, production and multi-tenancies with multiple teams and applications co-existing in the same cluster space all with their own varying needs.
First and foremost, sharing the policy is imperative. Your organisation has to absolutely accept the advantages of exposing policy and communicating that effectively with its developers far outweigh any potential security advantage gained through obscurity.
Along with sharing it, you need to articulate the benefits of each and every one. After all, you’ve hopefully hired some smart people, and smart people will try to find workarounds when they don’t see value in the obstruction.
Explaining the policy should hopefully help you justify it to yourselves too. Rules naturally should become based less on emotional and anecdotal and instead grounded in informed threat modelling that's easier to maintain as your threat landscape changes.
The next step is collecting the policy, codifying and assuring it is kept in version control. Once it's in version control, you can adopt the same semantic versioning strategy seen elsewhere that your developers will be used to.
Great so you’ve got your policy definitions in version control, tagged with semantic versioning, the next step is consuming that within your applications so your developers can test their applications against it, locally to start with, then later in continuous integration.
Hopefully your developers will be used to this at least, they can treat your policy like they treat versioned dependencies.
Now they’re testing locally, implementing the same check-in CI should be straightforward, this will assure that peer reviews are only ever carried out on code that is known to pass your policy.
Given it’s now a dependency, you can use tools like snyk/dependabot/renovate etc to automate making pull requests to keep it updated and highlight to your developers when the policy update is not compatible with their app.
Awesome. Now for the really tricky bit...
From a risk perspective, your organisation needs to be comfortable with accepting the transitionary period for old policy versions to be retired, which comes down to communication between those settings and those consuming/subjected to the policy, forwards and backward by one version, so your runtime needs to support at least three significant versions.
I’ve put together a reference model of this in a dedicated GitHub organisation with a bunch of repositories. Renovate was used to make automated pull requests on policy updates, you can see examples of that.
Other tools:
Please, allow me to introduce you to Example Policy Org.
This app is compliant with version 1.0.0 of the company policy only the pull request to update the policy to 2.0.1 currently can’t merge.
This app is compliant with version 2.0.1 of the company policy.
This app is compliant with version 2.0.1 of the company policy but it's only using 1.0.0 and can be updated with a pull-request.
Only one simple policy here, it requires that every resource has a label of mycompany.com/department so long as its set it doesn’t matter.
Following on from 1.0.0 we found that the lack of consistency isn’t helping, some people are setting it to ‘hr’ others ‘human resources'. So a breaking policy change has been introduced (hence the major version bump) to require the value to be from a known pre-determined list. So the mycompany.com/department label must be one of tech|accounts|servicedesk|hr.
The policy team forgot a department! So now the mycompany.com/department label must be one of tech|accounts|servicedesk|hr|sales. This was a non-breaking, very inor change, so we’re going to consider it a patch update, so we only increment the last segment of the version number.
This is an example of everything coexisting on a single cluster for simplicity and keeping this free to run I stand up the cluster each time using KiND, but this could just as well be a real cluster(s).
This is a simple tool to help our developers test their apps, they can simply run docker run --rm -ti -v $(pwd):/apps ghcr.io/example-policy-org/policy-checker when in the app and it’ll test if the app passes.
The location of the policy is intentionally hard coded, making this reusable outside of our example organisation would take some significant thought and is out of scope for this
What I haven’t done is require the mycompany.com/policy-version label, that’s probably part of the policy-checker and ci process’s job and also up to your cluster administrators to what they do with things that don’t have the label, you might for example exclude anything from kube-system, otherwise require the mycompany.com/policy-version >= 1.0.0 and update that minimum version as required. In reality it's just another rule, but separate from the rest of the policy codebase.
You should be able to reuse the principles of what we've covered in this article to go forth and version your organisation’s policy and make the dev experience a well-informed compliant breeze.
As you can see, this is far from the finished article. To share your thoughts, and if you think there's a better answer or more to it, tweet us!