Borked Cluster
Categories: tech
Tags: kubernetes
I let my homelab Kubernetes version stagnate for a while. Life got busy, all the typical excuses.
I made the mistake of updating last night. containerd updated last night overwriting it’s configuration file
borking everything. Of course I did not have it under Ansible version control :-) . After wrestling with kubelet
loosing track of all the containers I made the decision to restart the node. Come along with me to find out how to fix
some of the things!
Anthens Proxy
I have an Athens proxy setup in proxy-athens-prod installed via Helm which I need to find out it’s uses. Installation
predated my use of IaC on my cluster. To grab values from a Helm release you use can helm -n proxy-athens-prod get values r0
where r0 is the release. Using helm -n proxy-athens-prod list will show the chart like the following:
| NAME | NAMESPACE | REVISION | UPDATED | STATUS | CHART | APP VERSION |
|---|---|---|---|---|---|---|
| r0 | proxy-athens-prod | 11 | 2024-05-14 12:50:43.961647 -0700 PDT | deployed | athens-proxy-0.11.0 | v0.14.0 |
| redis-r0 | proxy-athens-prod | 1 | 2024-05-14 12:36:58.556789 -0700 PDT | deployed | redis-18.17.0 | 7.2.4 |
Unforutnately, this doesn’t provide me with the original chart URL. Athens repo points the chart repository https://github.com/gomods/athens-charts. Most recent version is 0.15.5! Looks like the only major change is the Redis chart, unsurprisingly.
Once i pushed this up I found a major problem: my internal GItea instance is down. Turns out my disks filled up.
containerd on a different volume
By default containerd uses /var/lib/containerd to store metadata. This effecitvely filled up my / volume and
stopped everything. Resolved this by issuing the following commands:
NODE=controlplane-00
OTHER_DRIVE_PATH=/data/some-drive/path
OLD_PATH=/var/lib/containerd
kubectl cordon "$NODE"
kubectl drain "$NODE" --delete-emptydir-data --ignore-daemonsets
systemctl stop kubelet
systemctl stop containerd
mkdir -p "$OTHER_DRIVE_PATH"
mv "$OLD_PATH" "$OTHER_DRIVE_PATH"
ln -s "$OTHER_DRIVE_PATH" "$OLD_PATH"
kubectl uncordon "$NODE"
After about 10 minute things in the cluster stablized and we could return to normal. Or so I thought. I eventaully needed
to run deployment-replica-mismatch.sh to find deployment mismatches and decdied to “go nuclear” to reduce complexity
with deployment-shutdown.sh .
ArgoCD -> Gitea
Using ArgoCD with repositories from my Gitea instance means Gitea is a strict dependency of modifying Kubernetes. Kind
of like a snake eating itself in some ways I guess. Gitea is refusing to start with the only error message an
fsnotify error. This appeared to actually be an error related to restart the devices above.
Really this was a symptom of the host being swampped. I went through and shutdown all pods in a CrashLoopBackoff or
Creating.
Re-approaching the problem
This obviously isn’t a I'll just giggle a knob to fix it type problem. Reprioritizing based on what is most important
to my users, my family. We run home-assistant as our house brains with a Longhorn volume for with data storage.
While booting up, it complained loudly about DNS failing! Fair, as unbound did not listen on the expected target.
Amazingly this came up without any intervention Paperless-NGX came up with an error. Likely indicates the faulting node was saturated.
Gitea
Gitea was broken because DNS was broken. This was broken as a result of the bitnami charts.
TODOs
- Find out why
unboundon hosts is not properly binding to the external IP address of the host. - Cluster internal DNS needs to be checke.d
Short term todos
- Move
deployment-replica-mismatch.shanddeployment-shutdown.sh. - Update this to point here https://bsky.app/profile/did:plc:rqh2ntpoz3mafvazrzbqfkxx/post/3mf5wrja5hk2h