Last update:
I get many questions about Kubernetes and persistence. Of course, persistence is essential for stateful apps. We often say that for stateful apps you need to use StatefulSet
and for stateless apps a Deployment
. It doesn't mean that you couldn't run stateful apps using deployments with persistent volumes. For example, the official MySQL Helm chart is using deployment. So, it can be done, but users get confused about this. What is the deal? When should you use deployment and when stateful set?
Previous blog post
Just blogged: Monthly Update 1 https://t.co/2oBOGo032C
— Alen Komljen (@alenkomljen) April 2, 2018
Persistent Volume Claim
To have persistence in Kuberntes, you need to create a Persistent Volume Claim or PVC which is later consumed by a pod. Also, you can get confused here because there is also a Persistent Volume or PV. If you have a default Storage Class or you specify which storage class to use when creating a PVC, PV creation is automatic. PV holds information about physical storage. PVC is just a request for PV. Another way and less desirable is to create a PV manually and attach PVC to it, skipping storage class altogether.
You can define a PVC and set the desired size, access modes, storage class name, etc. Let's create a zookeeper-vol
PVC:
⚡ cat <<EOF | kubectl create -f -
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: zookeeper-vol
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 8Gi
storageClassName: rbd
EOF
⚡ kubectl get pvc zookeeper-vol
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
zookeeper-vol Bound pvc-693857a8-3a8b-11e8-a34e-0238efc27e9c 8Gi RWO rbd 10s
In my example, I have a storage class rbd
which points to the Ceph cluster. When the new PVC gets created, a new 8GB volume is ready to use. The important thing here are the access modes:
- ReadWriteOnce – Mount a volume as read-write by a single node
- ReadOnlyMany – Mount the volume as read-only by many nodes
- ReadWriteMany – Mount the volume as read-write by many nodes
Access mode defines how a pod consumes this volume. In most cases, you set ReadWriteOnce
so that only one node can do read-write. Please note that this means more pods on a single node can still use the same volume. In some cases for stateless apps you want to have read-only volumes and for that, you need to use ReadOnlyMany
.
The rare case is ReadWriteMany
because only a few storage providers have the support for it. Think of ReadWriteMany
as NFS.
Define a Deployment with PVC
It is possible to create a PVC with ReadWriteOnce
access mode, and then to create a deployment which runs a stateful application and use this PVC. It works perfectly fine, but only if you don't want to scale your deployment. If you try to do it, you will probably get an error that volume is already in use when pod starts on another node. Even if that is not the case, and both pods end up on the same node, still they will write to the same volume. So you don't want this.
I created a Kubernetes ready Zookeeper Docker image for this blog post. Let's use the zookeeper-vol
PVC that I created before and create the new Zookeeper deployment which mounts this volume:
⚡ cat <<EOF | kubectl create -f -
apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: zookeeper
spec:
selector:
matchLabels:
app: zookeeper
replicas: 1
template:
metadata:
labels:
app: zookeeper
spec:
containers:
- env:
- name: ZOOKEEPER_SERVERS
value: "1"
image: "komljen/zookeeper:3.4.10"
imagePullPolicy: IfNotPresent
name: zookeeper
ports:
- containerPort: 2181
name: client
- containerPort: 2888
name: server
- containerPort: 3888
name: leader-election
readinessProbe:
exec:
command:
- /opt/zookeeper/bin/zkOK.sh
initialDelaySeconds: 10
timeoutSeconds: 2
periodSeconds: 5
livenessProbe:
exec:
command:
- /opt/zookeeper/bin/zkOK.sh
initialDelaySeconds: 120
timeoutSeconds: 2
periodSeconds: 5
volumeMounts:
- mountPath: /data
name: zookeeper-data
restartPolicy: Always
volumes:
- name: zookeeper-data
persistentVolumeClaim:
claimName: zookeeper-vol
---
apiVersion: v1
kind: Service
metadata:
name: zookeeper
spec:
ports:
- name: client
port: 2181
targetPort: 2181
selector:
app: zookeeper
---
apiVersion: v1
kind: Service
metadata:
name: zookeeper-server
spec:
clusterIP: None
ports:
- name: server
port: 2888
targetPort: 2888
- name: leader-election
port: 3888
targetPort: 3888
selector:
app: zookeeper
EOF
If you try to scale this deployment, other replicas will try to mount and use the same volume. It is okay if your volume is read-only. So, how to work around it for read-write volumes?
Define a Stateful Set with PVC
When you have an app which requires persistence, you should create a stateful set instead of deployment. There are many benefits. Also, you will not have to create a PVCs in advance, and you will be able to scale it easily. Of course, the scaling depends on the app you are deploying. With the stateful set, you can define a volumeClaimTemplates
so that a new PVC is created for each replica automatically. Also, you will end up with only one file which defines your app and also persistent volumes. Now let's try to deploy Zookeeper using a stateful set:
⚡ cat <<EOF | kubectl create -f -
apiVersion: apps/v1beta2
kind: StatefulSet
metadata:
name: zookeeper
spec:
selector:
matchLabels:
app: zookeeper
replicas: 1
serviceName: zookeeper-server
template:
metadata:
labels:
app: zookeeper
spec:
containers:
- env:
- name: ZOOKEEPER_SERVERS
value: "1"
image: "komljen/zookeeper:3.4.10"
imagePullPolicy: IfNotPresent
name: zookeeper
ports:
- containerPort: 2181
name: client
- containerPort: 2888
name: server
- containerPort: 3888
name: leader-election
readinessProbe:
exec:
command:
- /opt/zookeeper/bin/zkOK.sh
initialDelaySeconds: 10
timeoutSeconds: 2
periodSeconds: 5
livenessProbe:
exec:
command:
- /opt/zookeeper/bin/zkOK.sh
initialDelaySeconds: 120
timeoutSeconds: 2
periodSeconds: 5
volumeMounts:
- mountPath: /data
name: zookeeper-vol
restartPolicy: Always
volumeClaimTemplates:
- metadata:
name: zookeeper-vol
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 8Gi
storageClassName: rbd
---
apiVersion: v1
kind: Service
metadata:
name: zookeeper
spec:
ports:
- name: client
port: 2181
targetPort: 2181
selector:
app: zookeeper
---
apiVersion: v1
kind: Service
metadata:
name: zookeeper-server
spec:
clusterIP: None
ports:
- name: server
port: 2888
targetPort: 2888
- name: leader-election
port: 3888
targetPort: 3888
selector:
app: zookeeper
EOF
The major difference compared to deployment is in this part:
spec:
volumeClaimTemplates:
- metadata:
name: zookeeper-vol
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 8Gi
storageClassName: rbd
After you create this stateful set the new PVC is also created for a pod zookeeper-0
:
⚡ kubectl get pvc | grep zookeeper-0
zookeeper-vol-zookeeper-0 Bound pvc-68891ba1-3a94-11e8-a34e-0238efc27e9c 8Gi RWO rbd 2m
For each new replica, the stateful set will create a separate volume. Also, this way it is much easier to manage pods and PVCs at the same time.
Summary
Stateful sets are somehow left behind, and most users don't even consider it. They are much better at managing stateful apps and persistent volumes. If you want to learn more about the stateful set, in general, check the blog post that I wrote a few months ago - Stateful Applications on Kubernetes. Stay tuned for the next one.