Software-defined storage is not something new. One of the most popular is Ceph. I started with Ceph five years ago because I was looking into unified storage for OpenStack. There are many other solutions, but I like the Ceph because it is all in one solution for the block, object and file storage, and it is opensource. Inktank the company behind Ceph is later acquired by RedHat, but that made things even better. If you already have Ceph cluster running, it is easy to make use of it for Kubernetes. However, if you are designing entirely new on-premises Kubernetes cluster, you can run Ceph on top of it, and still, use it for other resources running on Kubernetes. This is where Rook comes into place. It provides deep Kubernetes integration made for cloud-native environments.
I'm excited about Rook. Not only because it solves persistent storage problems for Kubernetes, but also because it uses Ceph in the background. I designed at least five production grade Ceph clusters, so I'm pretty familiar with Ceph. For weeks I've been looking to write a post about Rook, and I finally made it.
Thanks @rook_io!— Alen Komljen (@alenkomljen) December 11, 2017
If you didn't hear about Rook.io yet, it is a Ceph on Kubernetes. In short, a cloud-native storage service. pic.twitter.com/6oeuGKHPzb
The Rook Way of Ceph deployment
The good news, you can run Ceph on Kubernetes and then use that storage for other Kubernetes resources. Rook, in a nutshell, is an operator which means that Rook manages Ceph cluster for you. To learn more about operators, a few weeks ago I wrote about Elasticsearch operator and how it works, so you might take a look if you want to dig deeper. Rook architecture diagram:
Of course, because Ceph requires extra drives to store the data, you would need to have a set of dedicated Kubernetes nodes. Currently, Rook is in an alpha state, but I'm expecting it to be production ready soon.
The easiest way to install Rook is using Helm. If you still didn't try Helm, it is the right time to do that. Add the new Helm repo and install Rook operator in
⚡ helm repo add rook-master https://charts.rook.io/master ⚡ helm search rook NAME CHART VERSION APP VERSION DESCRIPTION rook-master/rook v0.7.0-10.g3bcee98 File, Block, and Object Storage Services for yo... ⚡ helm install --name rook rook-master/rook \ --namespace kube-system \ --version v0.7.0-10.g3bcee98 \ --set rbacEnable=false
This Helm chart installs Rook operator and agents on each node. Check if everything is running and ready:
⚡ kubectl -n kube-system get pods -l 'app in (rook-operator, rook-agent)' NAME READY STATUS RESTARTS AGE rook-agent-4rhwt 1/1 Running 0 4m rook-agent-6s9v8 1/1 Running 0 4m rook-agent-8kgr9 1/1 Running 0 4m rook-agent-wqg9l 1/1 Running 0 4m rook-operator-845b8b8d4-p6cln 1/1 Running 0 4m
NOTE: If you are installing Rook on Kubernetes nodes running CoreOS or RancherOS you need to configure flexible volume first!
With Rook operator in place we have the new custom resources available. However, we still don't have Ceph cluster running.
To better understand Rook, first, you need to understand Ceph. Ceph is all in one solution for the block, object and file storage. The block storage (think of EBS) is what will probably be more interesting to you. Each time you create a Kubernetes Persistent Volume Claim or PVC, the Ceph will create the new volume. The main component responsible for block storage is Ceph OSD along with Ceph MON which provides cluster membership, configuration, and state. Those two components are enough to have a distributed block storage. There are other daemons for additional storage types and some helpers like API, etc.
Object storage (think of S3) is another layer, and Ceph component responsible for it is Ceph RadosGW. If you want to learn more about Ceph check official architecture docs.
Each Ceph OSD daemon handles only one physical drive. OSD stores the data in small objects which are part of placements groups or PGs. The placement groups are part of one pool which is distributed across other OSD nodes. Of course, you can have many pools, and each pool has defined number of replicas. Which means when you create a PVC, the data is everywhere, on each storage node and replicated.
For HA Ceph cluster you need at least three nodes. It is advisable to run an odd-number of monitors to have a quorum and default is set to three. Let's define the new Ceph cluster in
⚡ kubectl create namespace rook ⚡ cat <<EOF | kubectl create -n rook -f - apiVersion: rook.io/v1alpha1 kind: Cluster metadata: name: rook spec: dataDirHostPath: /var/lib/rook storage: useAllNodes: true useAllDevices: false storeConfig: storeType: bluestore databaseSizeMB: 1024 journalSizeMB: 1024 EOF
Please check the docs for all available options and explanation for above config. Wait a few minutes and Ceph cluster should be up and running:
⚡ kubectl get pods -n rook NAME READY STATUS RESTARTS AGE rook-api-854ffcf7b-6hnmw 1/1 Running 0 15m rook-ceph-mgr0-7957dc8d6c-xndkn 1/1 Running 0 15m rook-ceph-mon0-x6782 1/1 Running 0 16m rook-ceph-mon1-262tl 1/1 Running 0 16m rook-ceph-mon2-v2xv8 1/1 Running 0 16m rook-ceph-osd-6jfmh 1/1 Running 0 15m rook-ceph-osd-9f7w2 1/1 Running 0 15m rook-ceph-osd-ds4h7 1/1 Running 1 15m rook-ceph-osd-hkx87 1/1 Running 0 15m
For an experienced Ceph user, you want to be able to run
ceph commands to check your cluster state. The easiest way is to deploy a separate
rook-toolbox Pod and run commands from there:
⚡ cat <<EOF | kubectl create -n rook -f - apiVersion: v1 kind: Pod metadata: name: rook-tools namespace: rook spec: dnsPolicy: ClusterFirstWithHostNet containers: - name: rook-tools image: rook/toolbox:master imagePullPolicy: IfNotPresent env: - name: ROOK_ADMIN_SECRET valueFrom: secretKeyRef: name: rook-ceph-mon key: admin-secret securityContext: privileged: true volumeMounts: - mountPath: /dev name: dev - mountPath: /sys/bus name: sysbus - mountPath: /lib/modules name: libmodules - name: mon-endpoint-volume mountPath: /etc/rook hostNetwork: false volumes: - name: dev hostPath: path: /dev - name: sysbus hostPath: path: /sys/bus - name: libmodules hostPath: path: /lib/modules - name: mon-endpoint-volume configMap: name: rook-ceph-mon-endpoints items: - key: data path: mon-endpoints EOF
Now, for example, you can run a Ceph status check command:
⚡ kubectl -n rook exec rook-tools -- ceph -s cluster: id: 053cd70f-9b43-4854-862e-5bed29f1060d health: HEALTH_OK services: mon: 3 daemons, quorum rook-ceph-mon1,rook-ceph-mon0,rook-ceph-mon2 mgr: rook-ceph-mgr0(active) osd: 4 osds: 4 up, 4 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 bytes usage: 8199 MB used, 145 GB / 153 GB avail pgs:
Ceph should report
HEALTH_OK status, but we have 0 pools available. Before we can consume this cluster, we need to create at least one pool with the desired number of replicas. The number of replicas is usually set to three:
⚡ cat <<EOF | kubectl create -n rook -f - apiVersion: rook.io/v1alpha1 kind: Pool metadata: name: replicapool spec: replicated: size: 3 EOF
So finally, it is time to define the
StorageClass for the above pool. Then we can create new PVCs:
⚡ cat <<EOF | kubectl create -f - apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: rook-block provisioner: rook.io/block parameters: pool: replicapool EOF
Let's create a simple PVC to test if Ceph cluster is working fine:
⚡ cat <<EOF | kubectl create -f - kind: PersistentVolumeClaim apiVersion: v1 metadata: name: myclaim spec: accessModes: - ReadWriteOnce resources: requests: storage: 8Gi storageClassName: rook-block EOF ⚡ kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE myclaim Bound pvc-5f162665-1fa5-11e8-9056-525400474652 8Gi RWO rook-block 3s
There are many options for how to configure a Ceph cluster. Sometimes you have mixed drive types, and you want to have different pools for them. For example, fast storage with SSDs and slow with HDDs. Also, you may want to tune the Ceph cluster a little bit, but all those are advanced features. You should learn more about Ceph before moving forward with Rook.
A few weeks ago Rook became CNCF project which is good news. Keep in mind that Rook is not production ready yet, and some things can change. Can't wait to put it in place someday for large on-premises distributed storage. For any questions or concerns, please leave a comment. Stay tuned for the next one.