The Great Kubernetes Migration

Howdy!

This is somewhat of a continuation from my original post lamenting over my home infrastructure, The Plights of Bare-Metal Docker Networking.

As I touched on previously, I decided to move over to Kubernetes (specifically microk8s) instead of continuing the frustration that was my jank-tastic Docker setup. There’s a lot of research and development that went into getting all of my systems converted over to Kubernetes, and I hit a few hurdles that others may find a bit difficult as well. I’ll touch on which tech choices I made through the setup and why I went the specific path I did.

Micro Kubernetes

There’s a lot of choice nowadays when it comes to Kubernetes. Just a few years ago, you were effectively stuck setting up this mess all on your own and there wasn’t nearly as many slick ways to kick off a Kubernetes instance as there are today.

A few different methods laid in front of me, though I immediately gravitated towards k3s and microk8s. In my case, I had three nodes available to me, and I wanted to configure a highly-available cluster. High-availability was a very self-imposed requirement, since my environment doesn’t really require for that kind of durability. However, there’s a benefit to having a HA control plane, which means I won’t have to worry about everything grinding to a halt if I decide to shut down any particular host.

One of the other challenges I faced during my initial speculation of this setup was opting to go bare-metal. Most of the guidance online will point you towards virtualized solutions, or completely lean on cloud providers. I wasn’t terribly interested in adding another layer by considering virtualized nodes, though it seems to be a very supported avenue that I could have taken. I lamented a bit on Proxmox (PVE) in my previous post, where I seemed to harbor negative opinions on it. I already have a host dedicated for virtualization and knew for certain that I would only ever be running containers on these hosts, so I wanted to keep the profile as slim as possible.

Before I go any further into software, though, I’ll touch on my hardware specs.

Host	CPU	RAM	Boot Disk	Data Disk	OS
prod0	Epyc 7601	256GB ECC	2x 500GB SATA	500GB NVMe	Ubuntu 22.04
prod1	Threadripper 1950X	64GB non-ECC	2x 500GB SATA	500GB NVMe	Ubuntu 22.04
prod2	Epyc 7601	256GB ECC	2x 500GB SATA	500GB NVMe	Ubuntu 22.04

With that said, I opted to go for microk8s, though I kept k3s as a fallback plan in the event something went wrong. For the longest time, I have already opted for Debian hosts, though I decided to silently migrate over to Ubuntu as I was leaning closer to microk8s. It wasn’t really a required upgrade, but it eliminated some of my setup process by doing so and I was running Canonical’s Kubernetes flavour on a Canonical distribution, so the path seemed to have the least friction.

One of the major appeals to a Kubernetes novice was the module system that was included with microk8s. Many of the modules that I was interested in using (more on those later) were a simple microk8s enable <thing> away and I would have it automatically deployed to the cluster for me. Furthermore, high-availability was something assumed as soon as you brought three nodes into the cluster. Now, most of the Kubernetes world relies on Helm charts to get things installed, but I had literally never dug into any of this beforehand. I was hoping for a somewhat guided experience, but I still wanted to get my hands dirty with kubectl and develop my understanding since my intention was to rely on Kubernetes for my infrastructure moving forward. k3s did make some assumptions about what you wanted in a Kubernetes distribution, which meant it included more than microk8s does. I really wanted a bare-minimum install of Kubernetes to build from the ground-up by myself, not something which has a ton of tooling already set up for me. The opt-in nature of microk8s was the driving factor that led me to pick it up.

Modules

Some of the microk8s modules aren’t the most complicated in the world. You can actually see what every single module does to your system right in their repository. It’s incredibly reassuring that most feature toggles are really just redirects to a bash script, as it means it’s something I’m able to dissect and understand at a later point in time. The enable or disable command really just point to some plain ol’ bash. Many of the addons just wrap around helm charts, like the dns or observability addons. This made them an excellent learning tool, and meant I could incorporate the helm charts directly into my infrastructure, instead of relying on the addon, if I desired.

There’s some paths (notably observability) where I decided to short-cut the module and bake in my own solution (aka, install the chart myself), but for most deployments, I opted to just install the module. It’s likely I’ll want to replace some of my dependence on addons down the line, so I can have a truly portable installation of Kubernetes, but I’m really just focused on getting the basics down and having something operational for the meantime.

Furthermore, implementing your own addons appears to be easy to accomplish, and there’s numerous community-driven modules which implement popular Kubernetes tooling. I didn’t opt to do any of this, since it wasn’t strictly how the rest of the Kubernetes world dealt with it. I’ll likely author my own helm charts if I decide to do something like that, but it’s very nice to have the option available. Most of my deployment is handled by an Ansible playbook which manages all three of the nodes I listed above, so I have a reliable means of deploying features to my cluster should I decide to add anything else in.

Existing Requirements (and how they were met)

Now, there are certain behaviours which I’ve come to expect in my existing setup that I’d like to be able to reproduce.

Replicated storage, since every node should have an equal chance of hosting any particular service.
Controlled network ingress, via real or virtual IPs or some other means.
Controlled network egress, via real IPs or specially crafted routes or some other means.

Storage

Originally, I was using GlusterFS with a brick for my /var/lib/docker/volumes directory on each host. I was operating off of a replicated storage medium, which had a hardware backing on every single node. This is a bit challenging, since I could continue to use GlusterFS and pick up a lot of benefits from doing so. Gluster’s setup process is also fairly simple, compared to more robust solutions like Ceph. When it comes to persistent storage, Kubernetes actually gives a lot of options, even for bare-metal solutions like mine.

However, I opted to try something a bit new. OpenEBS has been pushing rapid development on Mayastor, so I’ve decided to use that as a storage backing for the time being. Mayastor is a fairly young project at this time and there’s quite a few positives and negatives to working with it.

+ Allocation is managed within the cluster, using Custom Resource Definitions (MayastorPool).
+ RAM disk allocation is supported out of the box.
+ Mayastor makes use of the NVMe protocol, and promotes usage of NVMe-over-Fabric.
+ Potential for using RDMA over Converged Ethernet down the line, for incredible speeds.
- Missing many CSI features, such as native snapshotting.
- "Highly durable, but not highly available."
- It's still very beta software.

The manifest allocation for Mayastor is incredibly flexible. Pools are limited to a single node, and a “MayastorPool” is really just a single device. However, a storage class can be defined which encourages any number of replicas at any given point, across any set of defined MayastorPools declared on the system. If you decided to do two-way replication, Mayastor would supposedly find the most tolerant method to do it. I won’t go into much more detail here, but allocation is handled intelligently, where replication won’t happen on the same node if you decide to have multiple pools available.

Up to this point, I haven’t made much of a use of RAM disks. If anything, this is an opportunity to optimize some of my local services to store data in a RAM disk then periodically sync to a persistent volume, though I’m not really sure how much I’ll make use of this. Having the option is a plus, but not a huge draw (yet).

There’s numerous performance promises theoretically possible by using NVMe-oF, which would dramatically reduce cross-system latency for updating mirrors of the physical pools. Adding RDMA into that mix would possibly help things further, which is always great for storage performance. What real-world difference it might make, I can’t say. I’m not really heading into this expecting some leap and bound improvement in performance, especially as these are PCIe 3 drives on a 20Gbit network fabric, though I’d like to include some results on any potential I/O overhead.

Touching on the negatives, a storage backend without snapshotting may be detrimental in some setups to data backup. At first, I was expecting to suffer with running sidecar containers to replicate this data on a schedule, but fortunately Velero (a very popular Kubernetes disaster recovery tool) seems to have basic support for filesystem-level backups. It’s not perfect and it’s still very beta (like everything else), and may not be officially sanctioned in my predicament, but it does have differential backups and I have free reign over the backup schedule. It also means that I can back up my cluster state on a namespace-by-namespace basis, though I intend for most things to be as reproducible as possible from their git repositories.

The phrase “highly durable, but not highly available” did lead me into a bit of a conflict, but realistically my needs aren’t sufficient for a true highly available storage backend. In the event of failure of a MayastorPool link, any pods using that link will find their storage backend ripped right out from under them. Certainly not great, but this means hardware failure likely occurred. In the event of power outage or other system-level failure, the pods would be taken out regardless, so it’s not as much of a concern. A disk failure isn’t a great situation to crash a pod, but theoretically it should be reallocated onto a different node with a good pool link until a replacement can be allocated. If there’s any significant issues with data corruption, there’s a reason backups exist.

Finally, all of this is still beta. I’m taking a leap of faith by integrating Mayastor into my stack, but it has a ease-of-use that I simply wasn’t expecting with Kubernetes storage backends, and it mitigates all of the pain of handling a GlusterFS cluster. Whether it will work out in the long term is another story, but I have other options available to me if it doesn’t work out. Longhorn is a perfectly acceptable solution as well (and would probably have similar performance characteristics to my Mayastor setup), but the barrier of entry was much lower for Mayastor in the microk8s system versus Longhorn.

Networking Ingress

In my Docker Swarm setup, as I mentioned previously, ingress has been a bit of a challenge. Long story short, I’ve been using a Python application that I wrote which I’ve nicknamed Bullseye. Bullseye is incredibly simple, and lets labelled containers modify the host-level network configuration. In the real-world, this is absolutely horrible practice and can be somewhat error-prone. Fortunately, I’ve been aware of the limitations of my tool, so I haven’t had major issues with it, but ultimately it’s a janky solution.

MetalLB was the obvious choice in this case, so I’ll be using MetalLB to handle assigning virtual IPs on a dedicated “service” network which my firewall will be able to route to. This isn’t exactly equivalent to what I’ve done originally, since it uses BGP routes to accomplish servicing secondary IP addresses, where I was actually assigning these addresses and doing some iptables magic behind the scenes to make it “appear” as if the container was a standalone system. BGP was designed for this kind of routing, and MetalLB is stable (albeit beta) and well-supported at this point in time, so it’s a safer bet than what I’m currently doing.

Networking Egress

One last challenge with networking has been getting traffic out to the Internet. In most applications, this is pretty straightforward. Use the host’s default network route and leave it at that. However, I do have need for pods to use a VPN. Ideally, I’d be routing all of the pod’s egress through this VPN and ensuring no traffic leaks out over the normal route.

Now, the brilliant folk(s) at k8s-at-home have a practical way to roll a VPN pod in Kubernetes and use that pod as a gateway. However, I actually have an existing setup, sort of. Currently, my VPNs are managed at the router level and I have allocated a /24 private IP slice for each VPN gateway. Previously, I managed this by throwing a container a macvlan adapter and let it use the sliced network to route on. For instance, one network is 192.168.20.0/24, another is 192.168.21.0/24, and so on, for each VPN gateway. I was able to re-use this setup through Multus, which allows me to make use of a host bridge to request an IP on the real network, using my upstream DHCP server. The setup for this was a little more convoluted than that, but I’ll touch on it shortly.

First Steps and Problem Solving

Let’s get to something actionable, so you might have an idea of what’s really going on here.

Ansible

This will only be a high-level summary of my experience with Ansible, as it’s not really the intended focus of this article. I effectively needed to re-design my compute playbook from scratch, since I had entirely different needs than my original Docker setup.

There’s some “central” playbooks that basically all of my systems use, whether they’re off-site or at home in the rack. These aren’t public, but they’re about as complicated as installing SSH keys, setting some sysctl values, adding packages, and creating a couple users.

Like before, I install my certificate authority. This is a pretty simple file copy and update-ca-certificates to get it baked in everything. Next, I configure my network interfaces through a templated netplan.yml config. I completely purge the installer-provisioned config and template in my own. Each of them look something like this:

network:
  bonds:
    bond0:
      dhcp4: true
      # dual gigabit interfaces on each host.
      interfaces:
        - eno1
        - eno2
      macaddress: <snip>
      # 
      parameters:
        mode: balance-rr
      routes:
        - to: default
          via: 192.168.1.1
    hypernet0:
      dhcp4: true
      dhcp4-overrides:
        use-routes: false
      interfaces:
        - enp97s0
        - enp97s0d1
      macaddress: <snip>
      parameters:
        mode: balance-rr
  ethernets:
    eno1: {}
    eno2: {}
    enp97s0: {}
    enp97s0d1: {}
  vlans:
    vlan.5:
      id: 5
      link: bond0
      dhcp4: false
    vlan.20:
      id: 20
      link: bond0
      dhcp4: false
    # .. same thing continues on ..
  bridges:
    vmbr5:
      dhcp4: true
      macaddress: <snip>
      interfaces: [ vlan.5 ]
    vmbr20:
      dhcp4: true
      macaddress: <snip>
      interfaces: [ vlan.20 ]
    # .. same thing continues on ..

MAC addresses aren’t usually required for a netplan config, but since I’m creating a lot of network interfaces and am using DHCP (with static leases), it’s nice to enforce them. Also, since I have many interfaces on this host, and they all have valid gateways, I want to be certain that they are using the primary gateway (192.168.1.1) when I don’t specify a specific gateway for them to use. The hard routes definition under bond0 ensure that it will be the first route, no matter what.

In short, my routing table gets assembled like this:

default via 192.168.1.1 dev bond0 proto static onlink 
default via 192.168.2X.1 dev vmbr20 proto dhcp src 192.168.2X.50 metric 100
...
default via 192.168.5.1 dev vmbr5 proto dhcp src 192.168.5.50 metric 100 
10.1.91.128/26 via 10.1.91.128 dev vxlan.calico onlink 
10.1.149.192/26 via 10.1.149.192 dev vxlan.calico onlink

You might notice that hypernet0 doesn’t have a route. This is completely intended, as it’s a closed-circuit network on a dedicated switch that handles 2x10Gbit routing. In most cases, I will never saturate this interface, but theoretically I can push that kind of bandwidth from my storage devices. Additionally, Kubernetes will be handling all cluster communication and internal routing over these interfaces.

Finally, onto the good part: Kubernetes, well, microk8s.

The playbook will install microk8s, enable some modules (specifically Mayastor, Helm, and metrics-server), and deploy a few services through helm charts. This includes Velero (for backups), ArgoCD (for continuous deployment), Multus (for weird networking hacks), CoreDNS (for DNS, duh), kube-prom-stack and the Grafana/Loki/Tempo charts for visualizing and alerting, and a DHCP relay for the Multus setup that I mentioned before. You can find more information on this DHCP relay here. Effectively, the relay just passes the DHCP requests upstream so the pods can be agnostic to the pods with Multus-allocated interfaces.

Mayastor’s use is managed by the playbook as well, by creating the MayastorPool on each host and creating a StorageClass that wraps those pools. This is a n-replica pool, where the amount of replicas is equal to the number of nodes. In a larger setup, this would not be efficient, but I have a very limited selection of hosts and my cluster will only ever be so large, so this is tolerable for my current and future use-case. This isn’t too complicated, and looks something like this:

apiVersion: "openebs.io/v1alpha1"
kind: MayastorPool
metadata:
  # adding a MSP named revolver-prodX, where X is the node number.
  name: "revolver-prodX"
  namespace: mayastor
spec:
  # there's only one NVMe backing device per node, which always lives at `nvme0n1`.
  node: "prodX"
  disks: ["aio:///dev/nvme0n1"]

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  # naming my StorageClass "revolver"
  name: "revolver"
parameters:
  # using 3 replicas, since I have 3 hosts as of writing.
  repl: '3'
  # using Mayastor's NVMe-oF implementation.
  protocol: 'nvmf'
  ioTimeout: '60'
  local: 'true'
provisioner: io.openebs.csi-mayastor
volumeBindingMode: WaitForFirstConsumer

I’ve somewhat skimmed the details here, but this somewhat covers most of the automated work I’ve written up in Ansible.

Actually Deploying Things

Wow, deployment time! Let’s run through deploying a service now.

For the sake of length, I’ll be deploying a service that:

makes use of Multus through VPN gateway routing
uses a PersistentVolume with Mayastor as a StorageClass
allocates a virtual IP through MetalLB

These are usually split up into different manifests, but this is approximately how they’d look strung together.

# declaring the Namespace that all of this is going to live in.
apiVersion: v1
kind: Namespace
metadata:
  name: example
---
# creating the PersistentVolumeClaim which we'll use for example dynamic configuration data.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: example-data
  namespace: example
spec:
  storageClassName: revolver
  resources:
    requests:
      storage: 20Gi
  accessModes:
    - ReadWriteOnce
---
# the Multus NetworkAttachmentDefinition, creating our adapter to occupy the physical network.
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: vpn-gateway-example
  namespace: example
spec:
  # using the host's vmbr20 interface, which was created in the Ansible playbook I mentioned above,
  # we can use a macvlan adapter which will inherit the properties of that parent interface.
  # This doesn't allocate secondary IPs on the host interface, contrary to the format.
  # The interface is created in a different (Linux) namespace on the host, and the container binds to it.
  # Then, the cluster-local DHCP relay is used to request from the physical gateway (my router), which ultimately
  # gives this container a real IP address on a real network.
  config: |
      {
        "cniVersion": "0.3.0",
        "name": "vpn-gateway-example",
        "plugins": [
          {
            "type": "macvlan",
            "master": "vmbr20",
            "mode": "bridge",
            "ipam": {
              "type": "dhcp"
            }
          }
        ]
      }      
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: example
  namespace: example
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: example
  strategy:
    type: Recreate
  template:
    metadata:
      annotations:
        # using Velero for PV backups.
        backup.velero.io/backup-volumes: example-data
        # specifying the Multus network, created above.
        k8s.v1.cni.cncf.io/networks: vpn-gateway-example
      labels:
        app.kubernetes.io/name: example
    spec:
      restartPolicy: Always
      hostname: example
      initContainers:
        # special init container I've made which reworks the routing table of the container
        # this DOES persist to the main container, because init containers are fantastic.
        - name: update-routes
          image: registry.hydraulic/vpn-route-scripts:latest
          command: [ "update-routes", "192.168.20.1" ]
          securityContext:
            capabilities:
              add: 
                - NET_ADMIN
        # same container, different script; ensures that it doesn't match my external IP.
        # This "external IP" is set by my internal build server, which injects the IP into the image on-build.
        # If it matches, the container will fail and the pod won't start.
        - name: safety-check
          image: registry.hydraulic/vpn-route-scripts:latest
          command: [ "safety-check" ]
      containers:
        # the actual container that we're running in the Deployment.
        - image: registry.hydraulic/example:tag
          imagePullPolicy: Always
          name: example
          resources:
            limits:
              cpu: "1"
              memory: "2Gi"
            requests:
              memory: "1Gi"
          volumeMounts:
            # PV-backed example mount
            - name: example-data
              mountPath: /config
            # NFS-backed example mount
            - name: example-nfs
              mountPath: /data
      volumes:
        # Claim which is allocated in above manifest.
        - name: example-data
          persistentVolumeClaim:
            claimName: example-data
        # Declare a NFS mount here. You could also do this through a PVC, but this is easier.
        - name: example-nfs
          nfs:
            server: some.storage.server
            path: "/some/data/path/here"
---
apiVersion: v1
kind: Service
metadata:
  name: example-service-tcp
  namespace: example
  annotations:
    metallb.universe.tf/address-pool: default-addresspool
spec:
  # using MetalLB
  type: LoadBalancer
  # dedicated `192.168.5.0/24` network is for service ingress. Nothing else resides on that network.
  # So, this `example` service will live at 192.168.5.100, or whatever DNS resolves to that address.
  loadBalancerIP: "192.168.5.100"
  ports:
    - name: example
      protocol: TCP
      port: 80
      targetPort: 8080
  selector:
    app.kubernetes.io/name: example
---
apiVersion: v1
kind: Service
metadata:
  name: example
  namespace: example
spec:
  # generic Service, only accepts cluster-local traffic.
  # This effectively makes the container accessible via `example.example.svc.cluster.local` (and other names).
  # Normally I would have a `NetworkPolicy` in place to prevent other pods from communicating with this one, but
  # I've omitted it for this example.
  type: ClusterIP
  ports:
    - name: example-socket-tcp
      protocol: TCP
      port: 80
      targetPort: 8080
  selector:
    app.kubernetes.io/name: example

kubectl apply -f wall-of-text.yaml

A lot of my images are built in-house on an automated basis, either when a repo commit calls it or on a nightly basis. For my own projects and images, they’re typically only ran when the backing repository is updated. However, I keep local replicas of public images in the event I’m throttled, the registry goes down, or there’s Internet connection troubles. Typically, I create a “obtain-version” script in each image that I’ve extended which lets me dynamically tag things as they’re updated, so I can check when new versions are out by simply checking my build server. This isn’t totally flawless, but it serves my need of avoiding reaching out to the Internet each time a host needs a specific copy of the image, and it allows me to have more control over each container image’s lifecycle.

I’ve opted to use the GitOps strategy to manage all of my deployments. In other words, each namespace is allocated a single git repository. Currently, ArgoCD polls each repository every couple minutes (which is fine, since it’s over cluster networking) checking for any manifest updates. I have webhooks from my Gitea server to ArgoCD to trigger a pod update, or I’ll manually promote a build version by updating the version string directly. I typically do this manually for most projects, where I will push a commit to update an application. However, for some of my auto-building images, I will use a :master tag and set imagePullPolicy: Always in the Deployment manifest, so that it will always keep the application updated. Generally, this isn’t the best practice and you want to avoid relying on tags that might change your image out from under you, but I have exclusive control over some of these images so it’s perfectly acceptable in my case.

Conclusion

If you’ve actually read all of this, that’s incredible. I hope this was educational to at least someone, so you can see what I’ve done to get things running. I’ll admit, some of this is pretty high-level in terms of detail, so it might not be enough for you to get a working microk8s setup directly. However, this was meant to glide over some of the tools and technology that other people have built which help me run a complex environment right out of my own house.

I had some hitches along the way. The most notable one, for those that might follow in my path, is the current state of microk8s v1.25. I spent many weeks troubleshooting some Calico issues that the microk8s maintainers seem to be aware of, which are only present in v1.25. This was causing all sorts of issues, like CoreDNS not being reachable for pods that weren’t on the same node. Most notably, Mayastor was totally broken in this state, since cross-node traffic was effectively broken, which broke the etcd cluster that it deploys using etcd-operator.

This has been a highly educational journey, that has actually been in development for over a month, and I’m now happy with a working Kubernetes cluster in my closet. Performance has been significantly better than my previous Docker environment, and day-to-day workflows are much more flexible. This is certainly just the beginning of my adaptation to Kubernetes, and I anticipate to stick with it for quite some time. Hopefully I won’t eat my words.

I would have never made it this far without a plethora of resources available on the Internet. I took a somewhat easier road to getting Kubernetes running in my home (microk8s), but you can make the journey as easy or as difficult as you’d like. From building every component from scratch, to deploying single-binary instances, to an interface that manages it all for you, there’s so many shapes and sizes for Kubernetes nowadays that it’s actually starting to make sense just about everywhere you go, even in homelabs.

Here’s some excellent resources that helped me out along the way:

the Kubernetes documentation - official documentation, duh.
Jeff Geerling’s YouTube series on Kubernetes - great example rolling out & scaling Drupal.
the Kubernetes Slack workspace - geared towards professional use of k8s.
the k8s-at-home Discord - led me to explore other network solutions, like Cilium and Multus.
TechnoTim’s Discord - gave me the inch I needed to use kube-prom-stack.

The Great Kubernetes Migration

Micro Kubernetes #

Modules #

Existing Requirements (and how they were met) #

Storage #

Networking Ingress #

Networking Egress #

First Steps and Problem Solving #

Ansible #

Actually Deploying Things #

Conclusion #