Back in April, I wrote a summary post about the project I had been working on to migrate to a new k3s cluster from the original Rancher Kubernetes cluster that I created in 2020. My intention was to continue that series and detail each part of that project. Unfortunately, I ran into some technical problems which meant that not only did the cluster get shutdown, but I didn’t have time to even look at it. Since this site was moved to the new k3s cluster, it also went down.

I also started a new business in 2020 which I’d been working on part-time while working full-time as an IT leader in a financial services company. In April, shortly after the post I referred to above, I also went to a marketing conference where I learned how to market my business. Since then, business has picked up enough that I was able to resign from my salaried job and work in the business full-time. We’re not profitable yet, but we’re on track to be profitable within a year.

Unfortunately, that meant I didn’t have time to finish the migration project and had just shut down the VMs, but a few weeks ago the Rancher cluster suffered a catastrophic crash and would not come back up. I had no choice but to make time to finish it.

Automation to the rescue

Fortunately, the guiding principles I set out for the project when I started have paid off. All of the VMs were provisioned using Terraform on the Proxmox cluster, k3s was installed using Ansible, and all of the workloads were deployed using helm and Ansible-driven templated Kubernetes manifests. That meant that when the new cluster ran into problems (more on that in a minute), it was just a matter of deleting the VMs and recreating everything with a few commands and restore the Longhorn volumes from the backup.

The Problems

It turns out that most of the stability issues with the new cluster came down to two things. First, there is a bug in the NIC card driver which was causing the network connection on one of the Proxmox hosts to disconnect under heavy load. The cluster storage is provided by Longhorn which relies on each volume being replicated on 3 Kubernetes nodes (VMs). In my two node Proxmox cluster, three k3s nodes are on each physical server. As load increased on the cluster, the more unstable it became.

The second problem appears to be using ReadWriteMany volumes on Longhorn. As opposed to ReadWriteOnce, ReadWriteMany allows multiple pods on different nodes to read and write to a volume. The underlying technology, I believe, is that ReadWriteOnce uses NFS to mount the volume on other nodes in the cluster. Perhaps it’s related the the first problem as well, but every time I created a ReadWriteMany volume, the entire cluster became very unstable.

The Migration

In the original project, since I was new to Kubernetes, I used the Rancher web GUI to create many of the objects. Later, I started using Ansible playbooks to dynamically apply manifests to the cluster to create and maintain new applications. For example, when a new version was released, I only had to update the variable in the playbook with the version number and run the playbook.

The applications deployed using the GUI were the toughest as it required that I was recover the manifests from the cluster and convert them into Ansible playbooks:

$ kubectl -n namespace get deployment name -o yaml > name.yaml

Fortunately, I was able to get restore an old backup of the VMs and bring them up long enough to dump everything to manifest files.

#!/usr/bin/env bash

set -e

CONTEXT="$1"

if [[ -z ${CONTEXT} ]]; then
  echo "Usage: $0 KUBE-CONTEXT"
  exit 1
fi

NAMESPACES=$(kubectl --context ${CONTEXT} get -o json namespaces|jq '.items[].metadata.name'|sed "s/\"//g")
RESOURCES="pvc pv configmap serviceaccount secret ingress service deployment statefulset hpa job cronjob"

for ns in ${NAMESPACES};do
  for resource in ${RESOURCES};do
    rsrcs=$(kubectl --context ${CONTEXT} -n ${ns} get -o json ${resource}|jq '.items[].metadata.name'|sed "s/\"//g")
    for r in ${rsrcs};do
      dir="${CONTEXT}/${ns}/${resource}"
      mkdir -p "${dir}"
      kubectl --context ${CONTEXT} -n ${ns} get -o yaml ${resource} ${r} > "${dir}/${r}.yaml"
    done
  done
done

The basic process was to create to:

  • Create a new Longhorn volume
  • Create a temporary volume pointing to the old NFS location
  • Run a kubernetes job to rsync from the source volume to the destination volume
  • Delete the temporary source volume
  • Convert old manifest or playbook into a new playbook to deploy all of the kubernetes objects

What’s next?

I know I’ve made this promise before, but I’m still planning to write additional posts going into more detail on each of the components of the project:

In order to keep this promise, I’m adding regular posts to my weekly TO DO list. This site has been focused on self-hosting and data sovereignty, but I also want to expand the topics I cover on this site into other areas that interest me. These include productivity, organization, time management, gaming, politics, my experience starting a new business, and my thoughts on the IT Services industry.