Managing State in Kubernetes

State

Many people associate state with databases, and that’s fair enough; databases are most certainly stateful and a prime use case. State as a concept is broader than databases, and so to understand state in Kubernetes, and in particular the difference between StatefulSets and PersistentVolumes, let’s dive into state abstractly.

State is the condition or quality of something at a given moment in time. For example, a couple of weeks before Christmas in 2017, the state of Bitcoin was awesome! A year later, the state of Bitcoin could be called disappointing. Or think of an employee who wakes up every day and goes to work when he or she is in a healthy state; however, if sick, that employee remains home. In that way, the employee is stateful, acting differently when it’s time for work depending on a state variable (sick = true or false).

In programming, we call functions and systems stateless if they always give the same output (response) to the same inputs. By contrast, a function or system is stateful if it can return different outputs given the same input.

Here’s a simple example of a stateless function in JavaScript:

function convertMetersToCentimeters (meters) {
return meters * 100
}

It always returns the same result given the same input. By contrast, here’s an example of a stateful function:

var total = 0
function addAndReturnTotal (number) {
total = total + number
return total
}

As you can see, unlike convertMetersToCentimeters, addAndReturnTotal will return a different result, even if you keep calling it over and over with the number 5. The first time you’ll get 5, the second 10, and the third time 15. Conversely, convertMetersToCentimeters will always give the same result, each time you call it with 5 as an argument, it will return 500.

Now let’s say you run this code on a server, and for some reason, users love it and keep using it, seeing how big of a number they can collectively achieve (if you think this is entirely hypothetical, see Cow Clicker). Your server is coming under immense load, and your users are unhappy about how long it is taking them to add numbers to the total!

You buy a bigger server, and migrate to it. There was some downtime, but not too long, and you feel good that you’ve satisfied your users. To your astonishment, your adder goes viral, and the load is clearly too much for one server. Plus, replacing the server every time you want to grow is extremely inefficient.

Kubernetes boasts of rapid horizontal scaling, so you pack your code up as a container image, and deploy it to Kubernetes in many identical Pods in a Deployment with a Service in front of it load balancing incoming requests across the many Pods (see our earlier post for details on these concepts). You setup Kubernetes on PKS, in a vSphere environment, where you can add physical hosts and VMs as needed to scale up the size of your cluster.

This fully solves your scaling problem, but introduces a bug. The addAndReturnTotal function runs horizontally scaled as many replicas, with different users hitting different instances. This strategy would have worked just fine for convertMetersToCentimeters since it’s stateless. But since addAndReturnTotal is stateful, we’ve introduced a bug.

Your users start complaining, they are sharing their results on forums and notice that many of them are getting conflicting total numbers back. In fact, sometimes the total even appears to the user to go down, which is odd considering you know the function has no way to subtract. The bug is caused because each Pod has separate memory, and so each Pod can have a different , and depending on which Pod a given user gets load balanced to, they might get a larger or smaller total than they expect.

To fix the bug, you actually would not necessarily need a StatefulSet, but you would definitely want a PersistentVolume (we’ll dive into both of these shortly). You could isolate state (the total number) to a text file and make your addAndReturnTotal function stateless. You create a PersistentVolume that holds the text file, mount the volume to all the Pods in the deployment (technically in the Deployment YAML manifest you’d reference the PersistentVolumeClaim), and then access the file to get and set the current total. Then you update your function to something like:

function addAndReturnTotal (currentTotal, newNumber) {
const newTotal = currentTotal + newNumber
return newTotal
}

Before you call your function, you first retrieve the current total from the file, lock the file and make other requests wait, pass the current total to your function along with the new number, then write the new total to the file and unlock it. The locking is necessary to keep a race condition from occurring between your Pods where one writes a new total while the other was in the middle of processing a different addition.

State is the enemy of horizontal scaling, and modern best practices focus on horizontal scaling since it is robust and flexible. But by isolating state to a trusted, redundant persistent volume (perhaps from vSAN with CockroachDB), you can horizontally scale the rest of your system. There are other reasons stateless where possible is preferable, such as consistency for automated testing.

PersistentVolumes

Kubernetes was originally designed for stateless logic, and expected you to run anything stateful elsewhere, but more recently PersistentVolumes and StatefulSets have been added. In Kubernetes, Volumes can survive a Pod restart, but will not survive the death of the Pod. And since Pods themselves are ephemeral in Kubernetes (they are destroyed and recreated in migrations and updates), the volume itself is ephemeral, and not generally suitable for long term storage like database files and user generated content.

The PersistentVolume was released to fit that need by allowing IT teams to manage databases and other persistent data on Kubernetes. To use PersistentVolumes you must have storage provided to Kubernetes, this could be on a vSAN datastore, a cloud bucket, a NAS device, or a simple disk attached to a physical host. In Kubernetes you would need a provisioner configured for any such source, and then a StorageClass to describe that storage option for your cluster.

Once you have a StorageClass, you could then define a PersistentVolumeClaim that named that StorageClass. Finally, in the Pod Spec of a Deployment, you can mount the PersistentVolumeClaim as if it were a volume. When the Deployment is deployed, Kubernetes makes the call to the storage provisioner configured in the StorageClass to create the disk/bucket on the fly. The Pods of that Deployment would then each have that new volume mounted to access that disk/bucket.

This is called Dynamic Provisioning, which means you did not precreate the disk/bucket, you relied on the provisioner to make those API calls for you. You can also go one step further and define a StorageClass called “default.” Then you could make a PersistentVolumeClaim that did not even specify a StorageClass, and you’d just get a disk/bucket from the provisioner you configured in the default StorageClass.

If you didn’t want Dynamic Provisioning at all, you could instead provision the disk/bucket yourself in the storage provider (examples: vSAN, S3, Google Cloud Storage), then define a PersistentVolume in Kubernetes to represent that piece of storage. You could then create a PersistentVolumeClaim and mount that claim in a Pod spec. This way Kubernetes won’t try to provision a new disk/bucket for you dynamically, but rather will just mount the one you’d setup outside of Kubernetes.

StatefulSets

Consider the adder app described in the earlier example. As it grows you decide to build new features like user accounts and payment processing. You realize immediately that a simple text file won’t cut it, and you need a real database. You break the app into three tiers and put the database tier in a StatefulSet.

StatefulSets are very different than PersistentVolumes, though for databases they are often used together. PersistentVolumes are mounted to Deployments with PersistentVolumeClaims, whereas StatefulSets have PersistentVolumes mounted with volumeClaimTemplates. Deployments are said to be stateless not because they can’t claim persistent disks, which they can. Instead they are said to be stateless because Kubernetes does not keep track of which Pod is which, but rather load balances across them irrespective of their identity.

StatefulSets were released because there are some cases where you need to know which Pod is which, even if it gets “moved” (destroyed and recreated). When you use a StatefulSet instead of a Deployment, the Pods are given unique identifiers, and unique DNS entries to keep their name on the network consistent even if destroyed and recreated.

When you attach a Service to a StatefulSet, you set clusterIP to none, and do not define a type (in the Service spec). This makes it a Headless Service, and Kubernetes does not load balance requests across the Pods. Rather you address each Pod individually.

Unless you have a good reason to do this, don’t, and instead use a Deployment with a Service type (usually ClusterIP for internal access, or LoadBalancer for external access). But there are good reasons to use StatefulSets with Headless Services, for example, running Zookeeper, Kafka, and databases.

Zookeeper leverages consensus algorithms for distributed systems, and each Pod needs a consistent ID to vote with. Before StatefulSets that would have been possible, but cumbersome, in Kubernetes. You would have had to define a separate Deployment and Service for each Pod, rather than containing them all in one StatefulSet with one Headless Service.

For a complete description with example configurations, see the ZooKeeper example in the Kubernetes documentation: Running ZooKeeper, A Distributed System Coordinator. For a more common example, see the Kubernetes blog on Deploying PostgreSQL Clusters using StatefulSets. This would have been the approach we would have taken to solve our Adder App database problem from the earlier example.

Managing State

Managing state in Kubernetes is achieved through StatefulSets and PersistentVolumes. They each deal with different aspects of managing state. PersistentVolumes are for persistent storage, while StatefulSets are for addressing Pods directly with a unique identifier instead of just sending your requests to the Service which load balances them across them. StatefulSets and PersistentVolumes are often used together, for example when deploying databases. However, Deployments and ReplicaSets, which are stateless, can be given access to an outside source of persistent state through PersistentVolumes.