Stream of Consciousness

Mark Eschbach's random writings on various topics.

Borked Cluster

Categories: tech

Tags: kubernetes

I let my homelab Kubernetes version stagnate for a while. Life got busy, all the typical excuses.

I made the mistake of updating last night. containerd updated last night overwriting it’s configuration file borking everything. Of course I did not have it under Ansible version control :-) . After wrestling with kubelet loosing track of all the containers I made the decision to restart the node. Come along with me to find out how to fix some of the things!

Anthens Proxy

I have an Athens proxy setup in proxy-athens-prod installed via Helm which I need to find out it’s uses. Installation predated my use of IaC on my cluster. To grab values from a Helm release you use can helm -n proxy-athens-prod get values r0 where r0 is the release. Using helm -n proxy-athens-prod list will show the chart like the following:

NAMENAMESPACEREVISIONUPDATEDSTATUSCHARTAPP VERSION
r0proxy-athens-prod112024-05-14 12:50:43.961647 -0700 PDTdeployedathens-proxy-0.11.0v0.14.0
redis-r0proxy-athens-prod12024-05-14 12:36:58.556789 -0700 PDTdeployedredis-18.17.07.2.4

Unforutnately, this doesn’t provide me with the original chart URL. Athens repo points the chart repository https://github.com/gomods/athens-charts. Most recent version is 0.15.5! Looks like the only major change is the Redis chart, unsurprisingly.

Once i pushed this up I found a major problem: my internal GItea instance is down. Turns out my disks filled up.

containerd on a different volume

By default containerd uses /var/lib/containerd to store metadata. This effecitvely filled up my / volume and stopped everything. Resolved this by issuing the following commands:

NODE=controlplane-00
OTHER_DRIVE_PATH=/data/some-drive/path
OLD_PATH=/var/lib/containerd
kubectl cordon "$NODE"
kubectl drain "$NODE" --delete-emptydir-data --ignore-daemonsets
systemctl stop kubelet
systemctl stop containerd
mkdir -p "$OTHER_DRIVE_PATH"
mv "$OLD_PATH" "$OTHER_DRIVE_PATH"
ln -s "$OTHER_DRIVE_PATH" "$OLD_PATH"
kubectl uncordon "$NODE"

After about 10 minute things in the cluster stablized and we could return to normal. Or so I thought. I eventaully needed to run deployment-replica-mismatch.sh to find deployment mismatches and decdied to “go nuclear” to reduce complexity with deployment-shutdown.sh .

ArgoCD -> Gitea

Using ArgoCD with repositories from my Gitea instance means Gitea is a strict dependency of modifying Kubernetes. Kind of like a snake eating itself in some ways I guess. Gitea is refusing to start with the only error message an fsnotify error. This appeared to actually be an error related to restart the devices above.

Really this was a symptom of the host being swampped. I went through and shutdown all pods in a CrashLoopBackoff or Creating.

Re-approaching the problem

This obviously isn’t a I'll just giggle a knob to fix it type problem. Reprioritizing based on what is most important to my users, my family. We run home-assistant as our house brains with a Longhorn volume for with data storage. While booting up, it complained loudly about DNS failing! Fair, as unbound did not listen on the expected target.

Amazingly this came up without any intervention Paperless-NGX came up with an error. Likely indicates the faulting node was saturated.

Gitea

Gitea was broken because DNS was broken. This was broken as a result of the bitnami charts.

TODOs

  • Find out why unbound on hosts is not properly binding to the external IP address of the host.
  • Cluster internal DNS needs to be checke.d

Short term todos