[{"content":" \u0026ldquo;PASSED\u0026rdquo; doesn\u0026rsquo;t mean what you think it means.\nThe 2am alert storm My phone exploded with critical alerts overnight. ntfy was angry. Alertmanager was angry. Ceph was angry. Sonarr was rolling back to a state that didn\u0026rsquo;t exist. Prometheus had even disconnected itself from Alertmanager — the canary alert that fires when alerting itself is broken.\nThe root cause was buried under several cascading consequences, but the actual finding was simple: the NVMe holding /var on stanton-02 had stopped responding to interrupts. The kernel logged Disabling IRQ #173 and gave up. Ceph\u0026rsquo;s osd.1 went down, mon.b lost its RocksDB, and the whole cluster started swimming.\nOver the next several hours of recovery, I confirmed that this wasn\u0026rsquo;t a one-off. It was the latest event in a documented escalation that had been building for weeks. The Samsung 990 PRO 1TB on stanton-02 had been throwing controller-fatal-status events since mid-April — six events over 19 days, each interval shorter than the last, until the kernel finally gave up on the IRQ and walked away.\nBut that\u0026rsquo;s not the interesting bit. The interesting bit is that the drive\u0026rsquo;s two siblings — same batch, same firmware, same workload, in the same cluster — are also dying. Just at different rates.\nThis is part 1 of 2. Part 1 is the autopsy. Part 2 will come when the new drives arrive.\nThe cluster Three Minisforum MS-01 mini-PCs, each running Talos Linux as a Kubernetes control-plane node. Cluster name: stanton (cf. Star Citizen). Each box has two NVMe drives:\nnvme0 — Samsung PM9A3 1.92TB. Enterprise-class with PLP. Rook-Ceph OSD storage. Rock solid. nvme1 — Samsung 990 PRO 1TB. Consumer. Holds /var: etcd WAL, Ceph mon RocksDB, container logs, kubelet state. You can probably already see where this is going.\nI bought the three 990 PROs as a matched batch from Amazon AU\u0026rsquo;s Global Store on 2024-06-18. Same firmware revision (4B2QJXD7 — the one Samsung released after the 990 PRO firmware-killing-itself bug, so we\u0026rsquo;re not even talking about THAT problem). They went into production almost immediately and have been running 24/7 for ~22 months.\nThe workload on /var:\netcd WAL — Every Kubernetes API write. Pod scheduling, controller reconciliation, kubelet leases. Constant fsync. Ceph mon RocksDB — Cluster state churn. Constant tiny writes. Container runtime overlay — Image extraction, log writes, layer state. Fsync-heavy. Small-block-write heavy. The exact opposite of what consumer SSDs are tuned for.\nThe autopsy After getting the cluster back to HEALTH_OK, I pulled SMART data off all three nvme1 drives. Same command on each:\n1 2 kubectl debug node/stanton-XX --image=alpine --profile=sysadmin -- \\ sh -c \u0026#34;apk add -q smartmontools \u0026amp;\u0026amp; smartctl -a /dev/nvme1\u0026#34; Here\u0026rsquo;s the comparison:\nMetric stanton-01 stanton-02 (failed) stanton-03 Serial S73VNU0X303066H S73VNU0X303413H S73VNU0X303400H Firmware 4B2QJXD7 4B2QJXD7 4B2QJXD7 Power-On Hours 15,856 15,864 15,867 Percentage Used 42% 47% 50% Data Units Written 96.3 TB 112 TB 133 TB Power Cycles 83 35 38 Unsafe Shutdowns 37 (45% of cycles) 15 (43% of cycles) 13 (34% of cycles) Critical Warning 0x00 0x00 0x00 Media \u0026amp; Data Integrity Errors 0 0 0 Available Spare 100% 100% 100% SMART Self-Test PASSED PASSED PASSED Temperature 54°C 53°C 53°C Three things should jump out.\nFirst, all three drives \u0026ldquo;PASSED\u0026rdquo; the self-test. The drive that just died with a kernel-level IRQ-disable failure says it\u0026rsquo;s healthy. So does the one with 50% wear. So does the one I haven\u0026rsquo;t even seen flap yet.\nSecond, stanton-03 has more wear (50%) than the drive that just died (47%). It\u0026rsquo;s next in line.\nThird, the wear math doesn\u0026rsquo;t add up. The 990 PRO 1TB has a 600 TBW endurance rating. stanton-03 has written 133 TB — 22% of its rated endurance — but reports 50% used. The drives are wearing roughly twice as fast as host writes alone would suggest.\nThat last one is the actually interesting story.\nWhy is the wear accelerating? Percentage Used in NVMe SMART data isn\u0026rsquo;t a measurement of how many host writes you\u0026rsquo;ve done. It\u0026rsquo;s the drive\u0026rsquo;s own estimate of how much of its internal NAND endurance reserve has been consumed.\nFor consumer drives, the gap between \u0026ldquo;host writes\u0026rdquo; and \u0026ldquo;NAND wear\u0026rdquo; gets large when you have:\nSmall random writes — etcd does fsync after every write. The drive can\u0026rsquo;t batch these, so it ends up writing-in then re-writing pages constantly to maintain durability semantics. No power-loss protection — every unclean shutdown forces the drive to discard in-flight write buffers and rebuild from journal, which means re-writing pages the drive thought it could batch. Wear amplification. Mixed read/write pages — when read traffic and write traffic share NAND blocks, the drive shuffles data around to keep cells in spec. All extra writes the host never asked for. Each of those happens constantly under an etcd + mon workload.\nThe unsafe-shutdown counter is the nail in the coffin. Across the three drives:\nstanton-01: 37 unsafe shutdowns out of 83 cycles (45%) stanton-02: 15 unsafe shutdowns out of 35 cycles (43%) stanton-03: 13 unsafe shutdowns out of 38 cycles (34%) I don\u0026rsquo;t have a UPS. The cluster has weathered multiple powercuts since I built it, plus the occasional kernel-level reboot under stress. Every one of those is a little bit of write-amp punishment to a drive that has no capacitors to flush its DRAM cache to NAND.\nPower Loss Protection — what consumer NVMe doesn\u0026rsquo;t have Enterprise NVMe drives have a row of tantalum capacitors on the PCB. When the host yanks power, those caps hold the drive alive just long enough to flush its DRAM write buffer to flash. Result: no data loss, no in-flight pages stuck in limbo, no journal-replay amp on the next boot.\nConsumer NVMe drives do not have those capacitors. Cost-cut. The 990 PRO is a consumer drive. So is the SN850X. So is anything you\u0026rsquo;d buy at a big-box store with \u0026ldquo;Pro\u0026rdquo; in the name.\nWhen a consumer drive loses power mid-write:\nIn-flight writes that were in DRAM are gone. The host\u0026rsquo;s write-cache thinks they hit NAND, but they didn\u0026rsquo;t. On next boot, the drive replays its journal to figure out which pages are valid and which are torn. That replay re-writes a lot of pages \u0026ldquo;to be safe.\u0026rdquo; All of which counts against your NAND endurance reserve. This is why enterprise SSD specs say things like \u0026ldquo;0.4 DWPD\u0026rdquo; or \u0026ldquo;1 DWPD\u0026rdquo; or \u0026ldquo;3 DWPD\u0026rdquo; — Drive Writes Per Day, sustained for the warranty period (usually 5 years). The 990 PRO\u0026rsquo;s spec is 600 TBW over 5 years, which works out to about 0.33 DWPD if you do the math. That assumes a clean workload with no powercut amplification.\nWhat I have is consumer drives, with no PLP, doing fsync-heavy etcd workloads, on hosts with no UPS, in a region that has the occasional powercut. Of course the wear is accelerating.\nWhat SMART didn\u0026rsquo;t tell me The most maddening thing about this whole episode is that the drive\u0026rsquo;s \u0026ldquo;PASSED\u0026rdquo; self-test was technically correct, right up until it wasn\u0026rsquo;t.\nNVMe SMART tracks things like media errors, temperature exceedences, and the available-spare counter. None of those tripped. The drive on stanton-02 is still reporting 100% Available Spare and 0 Media Errors as of writing. It also happens to be unable to respond to interrupts anymore.\nThe actual signal of impending failure was buried in the kernel log — six controller-fatal-status events over 19 days, with the gap between events shrinking each time:\n1 nvme nvme1: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x11 CSTS=0x3 means the drive\u0026rsquo;s own controller is asserting fatal status on itself. That\u0026rsquo;s the drive saying \u0026ldquo;something is wrong with me, please reset me.\u0026rdquo; The kernel resets it, the drive comes back up, and SMART still says PASSED because by the spec, none of the threshold-based metrics have been crossed.\nThe escalation timeline:\n# Date (UTC) Failure mode 1 2026-04-15 Soft CFS reset, auto-recovered 2 2026-04-20 Soft CFS reset, auto-recovered 3 2026-04-21 Soft CFS reset, auto-recovered 4 2026-04-25 Soft CFS reset, auto-recovered 5 2026-04-30 Soft CFS reset, auto-recovered 6 2026-05-04 IRQ disabled, no auto-recovery The pattern is \u0026ldquo;drive needs increasingly frequent kicks until eventually the kernel gives up on it.\u0026rdquo; None of which shows up in smartctl --health.\nSide effects when the cluster_network is on the same node Here\u0026rsquo;s the bit that turned an annoying-but-recoverable single-drive failure into a 4-hour cluster-wide incident: stanton-02 also runs one of three Ceph monitors AND one of three OSDs. The 990 PRO holds /var (mon RocksDB), and the PM9A3 in nvme0 holds the OSD bluestore data. When the 990 PRO died, mon.b went silent, but the OSD itself was still up.\nThen I rebooted the node to get the drive back. The reboot killed the Thunderbolt ring that Ceph uses for cluster_network traffic — a documented MS-01 quirk where the second TB port doesn\u0026rsquo;t always re-enumerate after a warm boot. So when the node came back, OSDs were up,in per Ceph, but osd.1 and osd.2 couldn\u0026rsquo;t actually talk to each other over the cluster network. PGs got stuck peering for an hour while traffic spilled to public_network and the slow-heartbeat alarms climbed past 500 seconds.\nI wrote up the Thunderbolt fix separately — kernel arg thunderbolt.host_reset=0 baked into a custom factory.talos.dev schematic — but it\u0026rsquo;s worth mentioning here because it\u0026rsquo;s the failure-mode amplifier. A single dying disk wouldn\u0026rsquo;t have caused a cluster-wide alert storm if my Ceph cluster network wasn\u0026rsquo;t running over Thunderbolt cables that don\u0026rsquo;t always come back up after a reboot. Two unrelated weaknesses combined into one bad night.\nWhat I\u0026rsquo;m doing about it After confirming the failure was real and ongoing, I went back to Amazon AU. The drive had 38 months left on a 5-year warranty, the failure mode is documented in dmesg with timestamps and serial numbers, and the SMART screenshots showed the wear/unsafe-shutdown picture clearly. Amazon\u0026rsquo;s Global Store rep was sympathetic.\nTo my surprise, they refunded the full cost of all three drives — not just the failing one. Recognition that a same-batch matched set is going to fail in similar ways was a nicer outcome than I expected.\nNow I\u0026rsquo;m shopping for replacements. The path:\nEnterprise NVMe with hardware PLP — non-negotiable. The whole point is to remove the consumer-NAND-on-server-workload mismatch. M.2 22110 form factor — fits the MS-01\u0026rsquo;s slot 2 and 3. The PM9A3 already in nvme0 has been rock solid; putting its sibling family in nvme1 keeps the cluster homogeneous. At least 1 DWPD endurance class — overkill for my measured 180 GB/day write rate (~0.18 DWPD on a 1TB drive), but every doubling of headroom is insurance against future workload growth. The shortlist I\u0026rsquo;ve narrowed it to is Samsung PM9A3 M.2 22110 960GB (NEW from a Chinese eBay seller at ~AU$554 each) or Micron 7450 PRO 480GB (new retail, but the NZ pricing is eye-watering). The math + budget pushed me toward the PM9A3 — it matches the drive that\u0026rsquo;s been working flawlessly on the same cluster for 22 months.\nThat\u0026rsquo;s where Part 2 comes in. New drives, installation, performance comparison, the burn-in protocol, and the real test: whether enterprise PLP actually fixes the failure mode I\u0026rsquo;ve documented here, or whether the MS-01\u0026rsquo;s chassis is going to throw new and unexpected thermal headaches at me with 8.2W enterprise drives in slots designed for 5W consumer parts.\nLessons so far \u0026ldquo;PASSED\u0026rdquo; SMART status is necessary but not sufficient. Watch the kernel log for CSTS=0x3 and similar; SMART\u0026rsquo;s threshold-based metrics will lag behind the actual drive health by months. Consumer NVMe under etcd workload is a category error. Even on a homelab, if the drive holds /var for a Kubernetes control-plane, it\u0026rsquo;s doing enterprise work. Buy enterprise. The Percentage Used metric tells you the truth. When it\u0026rsquo;s growing roughly 2× faster than Data Units Written ÷ TBW would predict, your drive is wearing out faster than spec, and you need to plan for replacement before the controller events start. PLP is the structural fix. A UPS helps with powercuts but doesn\u0026rsquo;t fix the fsync-amp problem on consumer NAND. Same-batch drives die together. If one drive in a matched set fails, pull SMART on all of them. They\u0026rsquo;ll be on the same trajectory. In my case, the most-worn drive isn\u0026rsquo;t the one that failed first — it\u0026rsquo;s the one I haven\u0026rsquo;t seen flap yet. Architectural single-points-of-pain compound. A drive failure on its own is recoverable. A drive failure plus a fragile cluster_network on the same node is a bad night. Audit your dependencies before you have to. Part 2 incoming when the new drives arrive. Until then I\u0026rsquo;m running on borrowed time on stanton-03 (the 50%-wear sibling). Coffee in hand, alert thresholds tightened, Renovate auto-merge disabled on Ceph until the swap is done.\n","date":"2026-05-05T00:00:00+12:00","permalink":"https://blog.nerdz.cloud/2026/three-990-pros-all-dying-part-1/","title":"The Slow Death of Three Samsung 990 PROs"},{"content":" \u0026ldquo;The best AI assistant isn\u0026rsquo;t the smartest one. It\u0026rsquo;s the one that remembers you told it not to do that thing last Tuesday.\u0026rdquo;\nIntro It\u0026rsquo;s been a while since Part 3 where I got Ollama running as a DaemonSet with shared storage across my three MS-01 nodes. That setup worked, but it had some fundamental limitations that started bugging me:\nNo GPU heterogeneity — All three MS-01 nodes have Intel UHD 770 iGPUs. When I added pyro-01 (with a GTX 1080 Ti) to the cluster, Ollama had no way to federate inference across different GPU types. No load balancing — Requests hit whichever pod the service routed to. No awareness of which node was busy or idle. No memory — Every conversation started from zero. The AI had no idea who I was, what we\u0026rsquo;d talked about, or what I\u0026rsquo;d asked it to remember. That last one is the big one. I don\u0026rsquo;t just want a chatbot — I want an AI stack that builds context over time, across every interface I use.\nSo I ripped it all out and started over.\nWhy LocalAI? LocalAI is an OpenAI-compatible API server that runs locally, similar to Ollama. But it has some features that make it significantly more interesting for a multi-node homelab:\nHeterogeneous GPU Support My cluster has two types of GPU hardware:\nNode GPU VRAM LocalAI Image ms-01 (x3) Intel UHD 770 (iGPU) Shared 16GB RAM gpu-intel (SYCL/oneAPI) pyro-01 NVIDIA GTX 1080 Ti 11GB GDDR5X gpu-nvidia-cuda-12 LocalAI has dedicated container images for each GPU vendor. Different images, same API. Each worker loads models suited to its hardware.\nOpenAI-Compatible API Just like Ollama, LocalAI exposes /v1/chat/completions, /v1/embeddings, and all the standard OpenAI endpoints. Any tool that speaks OpenAI can talk to LocalAI without modification.\nMemory Reclaimer LocalAI can automatically evict idle models from memory when resources get tight. On constrained hardware (Intel iGPUs sharing system RAM), this is essential. Ollama would just OOM-kill.\nP2P Federation (The Feature I Wanted But Couldn\u0026rsquo;t Use) LocalAI advertises P2P federation using edgevpn and libp2p — a CPU-only load balancer that discovers GPU workers via a DHT mesh. Workers join the network with a shared token, and the LB routes requests automatically.\nThis was the killer feature that sold me on LocalAI. It didn\u0026rsquo;t work. More on that below.\nThe P2P Federation Trap I spent significant time building out a P2P federated setup:\nA CPU-only load balancer running local-ai federated Intel workers running local-ai run --p2p --federated An NVIDIA worker running the same A shared edgevpn token (base64-encoded YAML with room, rendezvous, mDNS, and OTP keys) via ExternalSecret The LB started fine. EdgeVPN initialized, DHT bootstrapped, and it listened on port 8080. Then it spammed No available nodes yet for 20+ minutes and never found a single worker.\nWhy P2P Fails in Kubernetes The edgevpn DHT bootstrap mechanism relies on two discovery methods:\nmDNS — Uses UDP multicast to 224.0.0.251:5353. Works on a LAN. Does not work across Kubernetes nodes. Each pod has its own network namespace; multicast doesn\u0026rsquo;t cross node boundaries in Cilium (or most CNIs).\nDHT via public IPFS bootstrap nodes — Falls back to bootstrap.libp2p.io:4001 when no custom bootstrap peers are set. In my cluster, this DNS name doesn\u0026rsquo;t resolve from inside pods. Even if it did, the DHT would need to successfully NAT-traverse between pods, which Cilium\u0026rsquo;s eBPF datapath doesn\u0026rsquo;t support for libp2p\u0026rsquo;s hole-punching.\nThere is a LOCALAI_P2P_BOOTSTRAP_PEERS_MADDRS env var (added in PR #4200) that lets you specify custom bootstrap peers. But the LB\u0026rsquo;s peer ID and port are randomly generated on each startup, so you\u0026rsquo;d need a stable identity (persisted key file) and a fixed listen port (LOCALAI_P2P_LISTEN_MADDRS), plus a headless service for stable DNS. At that point you\u0026rsquo;re fighting the architecture harder than using it.\nThe Simpler Solution I dropped P2P entirely and used direct Kubernetes services instead:\nEach worker group (Intel, NVIDIA) gets its own ClusterIP Service Kubernetes handles load balancing between the two Intel replicas naturally Consumers point at the right service directly No DHT, no mDNS, no edgevpn, no P2P token. Just k8s doing what k8s does.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Intel workers — 2 replicas, k8s load-balances service: app: controller: local-ai-intel ports: http: port: 8080 # NVIDIA worker — single replica service: app: controller: local-ai-nvidia ports: http: port: 8080 Open WebUI gets both via OPENAI_API_BASE_URLS (semicolons):\n1 OPENAI_API_BASE_URLS: \u0026#34;http://local-ai-intel:8080/v1;http://local-ai-nvidia:8080/v1\u0026#34; Other consumers point at whichever backend has their models — mem0 talks to Intel (embeddings), OpenClaw talks to NVIDIA (coder models).\nOther Gotchas P2P wasn\u0026rsquo;t the only thing that fought me. Here are the other issues I hit, because if you\u0026rsquo;re deploying LocalAI in Kubernetes, you\u0026rsquo;ll probably hit them too.\nBackend Alias Resolution is Broken at Model-Load Time When you set LOCALAI_EXTERNAL_BACKENDS: \u0026quot;llama-cpp\u0026quot;, LocalAI downloads a meta-backend from the gallery. On an Intel system, llama-cpp resolves to intel-sycl-f16-llama-cpp. On NVIDIA, it resolves to cuda12-llama-cpp.\nThe downloaded llama-cpp directory contains only a metadata.json with \u0026quot;meta_backend_for\u0026quot;: \u0026quot;intel-sycl-f16-llama-cpp\u0026quot; — no run.sh, no binaries. The actual backend is in the intel-sycl-f16-llama-cpp directory.\nIf your model config says backend: llama-cpp, LocalAI can\u0026rsquo;t follow the metadata alias at model-load time. It tries to use the llama-cpp directory directly, finds no run.sh, and fails with \u0026ldquo;all backends returned error.\u0026rdquo;\nFix: Always use platform-specific backend names in model configs:\nPlatform Use This Not This Intel iGPU intel-sycl-f16-llama-cpp llama-cpp Intel whisper intel-sycl-f16-whisper whisper NVIDIA CUDA cuda12-llama-cpp llama-cpp LOCALAI_CONFIG_DIR Is Not for Model Configs This one cost me hours. The LOCALAI_CONFIG_DIR flag (--localai-config-dir) sounds like where you put model YAML files. It is not. It\u0026rsquo;s only for api_keys.json and external_backends.json.\nModel YAML config files must live in the models directory (LOCALAI_MODELS_PATH, which defaults to /models/). If you mount them to a separate /configuration/ directory, LocalAI will never see them.\nI use ConfigMap subPath mounts to inject model configs alongside the GGUF files on the PVC:\n1 2 3 4 5 6 7 8 9 10 11 12 13 persistence: models: existingClaim: local-ai-models-intel globalMounts: - path: /models config: type: configMap name: local-ai-config-intel globalMounts: - path: /models/mistral-7b-instruct.yaml subPath: mistral-7b-instruct.yaml - path: /models/nomic-embed-text.yaml subPath: nomic-embed-text.yaml The download_files Directive Is Unreliable for Large Models LocalAI model configs support a download_files directive that downloads models from HuggingFace on startup:\n1 2 3 download_files: - filename: mistral-7b-instruct-v0.3.Q4_K_M.gguf uri: huggingface://MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF/... For small files (whisper at 142MB, piper TTS at 61MB, nomic embeddings at 81MB), this works fine. For large files (Mistral 7B at 4.1GB, Qwen 14B at 8.7GB), it consistently stalls mid-transfer with \u0026ldquo;Connection reset by peer\u0026rdquo; and leaves .partial files or corrupt incomplete files without the .partial extension.\nI ended up kubectl exec-ing into the pods and using wget with retry:\n1 2 3 kubectl exec -n cortex deploy/local-ai-intel -- \\ wget -c -t 0 -O /models/mistral-7b-instruct-v0.3.Q4_K_M.gguf \\ \u0026#34;https://huggingface.co/MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3.Q4_K_M.gguf\u0026#34; Once the files are on the PVC, they persist across pod restarts. You only need to do this once.\nNon-AIO Images Ship Without Backends The standard container images (like v3.12.1-gpu-nvidia-cuda-12) do not include pre-compiled backends. The /backends/ directory ships empty. Backends are downloaded from the OCI backend gallery at startup via LOCALAI_EXTERNAL_BACKENDS.\nThis means:\nFirst startup is slow (backends download from docker.io/localai/localai-backends) If the download fails (network, permissions, rate limiting), the backend directory is left empty and models fail to load with \u0026ldquo;backend not found\u0026rdquo; The backends: type: emptyDir mount in Kubernetes is correct — you need a writable directory since the image\u0026rsquo;s own /backends/ is empty If you want pre-baked backends, use the AIO (all-in-one) images. But those come with pre-configured models too, which may not be what you want.\nThe Architecture (What Actually Works) Here\u0026rsquo;s the final architecture, after all the P2P was stripped out:\nKey differences from the original plan:\nNo P2P load balancer — consumers talk directly to worker services Intel workers (x2): General-purpose models — Mistral 7B (chat), nomic-embed-text (embeddings), whisper (STT), piper (TTS) NVIDIA worker (x1): Coder models — Qwen 2.5 Coder 7B and 14B Component What It Does Why It\u0026rsquo;s Here LocalAI (Intel) Chat, embeddings, whisper, TTS General-purpose inference on 2x MS-01 iGPUs LocalAI (NVIDIA) Code generation Runs larger coder models on the 1080 Ti Qdrant Vector database Stores mem0 memories and Open WebUI\u0026rsquo;s document RAG PostgreSQL Relational database mem0 access controls + history, Open WebUI user data mem0 Memory extraction and retrieval The connective tissue — shared memory across all interfaces OpenClaw Discord agent Chat via Discord with persistent memory Open WebUI Browser chat UI Web-based chat with RAG and shared memory SearXNG Privacy-respecting search Web search for agent queries Claude Code CLI agent (local) Terminal-based AI with the same shared memory via MCP How Memory Flows Across Interfaces This is the part that excites me most. Say I\u0026rsquo;m chatting with my AI agent on Discord and I mention that I\u0026rsquo;m working on a Cilium BGP issue. mem0 extracts that fact, embeds it, and stores it in Qdrant — scoped to my user_id.\nLater, I open Claude Code in my terminal to work on the same problem. The mem0 MCP server searches for relevant memories, finds the Discord context, and injects it. Claude Code already knows what I\u0026rsquo;ve been working on without me repeating myself.\nMy wife opens Open WebUI to ask a cooking question? Her user_id is different — she gets her own memory space. No cross-contamination.\nOne Qdrant instance, multiple collections, all user-scoped. The same memory layer serves every interface.\nWhat About the Existing Stack? If you\u0026rsquo;ve been following the series, you\u0026rsquo;ll notice some things changed:\nBefore After Why Ollama (DaemonSet) LocalAI (k8s services) GPU heterogeneity, memory reclaimer, better model configs Open WebUI built-in memory mem0 (universal) Cross-interface memory sharing No vector DB Qdrant Required by both mem0 and Open WebUI RAG No agent OpenClaw (Discord) I wanted to interact via Discord Open WebUI and SearXNG stay — they were already deployed and working. They just get rewired to talk to LocalAI instead of Ollama, and Open WebUI gets the mem0 pipeline filter bolted on.\nPostgreSQL was already running via CloudNative-PG (it backs about 20 other apps in my cluster), so mem0 and Open WebUI just get new databases on the existing cluster.\nCurrent Status As of writing, here\u0026rsquo;s what\u0026rsquo;s deployed and working:\nComponent Status Models/Notes LocalAI Intel (x2) Running mistral-7b-instruct, nomic-embed-text, whisper-1, tts-1 LocalAI NVIDIA (x1) Running qwen2.5-coder:7b, qwen2.5-coder:14b Qdrant Running Vector storage ready mem0 Running API on port 8765, using openmemory-mcp image Open WebUI Running Connected to both LocalAI backends OpenClaw Running Discord bot with coder model access SearXNG Running Web search available What\u0026rsquo;s Coming in Part 5 Next post will cover the mem0 integration — wiring up the pipeline filters in Open WebUI, the OpenClaw mem0 plugin for Discord, and the self-hosted mem0 MCP server for Claude Code. That\u0026rsquo;s where the \u0026ldquo;AI that remembers you\u0026rdquo; promise actually comes together.\nThe full architecture document is in my home-ops repo. The manifests are under kubernetes/apps/cortex/.\n","date":"2026-03-12T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2026/deploying-open-llms-04/","title":"Deploying Open Source LLMs in a Homelab - Part 4"},{"content":" \u0026ldquo;The definition of insanity is doing the same thing over and over and expecting different results. The definition of homelabbing is trying every cloud provider until one doesn\u0026rsquo;t hate you.\u0026rdquo;\nThe Goal I needed an Android device for testing Bootible — a one-liner provisioning tool for gaming handhelds. The problem: I don\u0026rsquo;t have a spare Android device lying around, and buying one just for development felt wasteful.\nEnter Redroid — Android containers that run natively on ARM hardware. No emulation overhead, just containerised Android with ADB access. Connect via scrcpy and you\u0026rsquo;ve got a proper Android environment for testing.\nSimple enough, right?\nAttempt 1: TrueNAS SCALE My TrueNAS box has plenty of horsepower — 56 threads, 128GB RAM. Running Redroid there would keep everything local with zero latency. Perfect.\n1 2 ssh truenas sudo apt install linux-modules-extra-$(uname -r) 1 2 Reading package lists... Done E: dpkg was interrupted, you must manually run \u0026#39;sudo dpkg --configure -a\u0026#39; Okay, let\u0026rsquo;s try that:\n1 sudo dpkg --configure -a 1 dpkg: error: unable to access dpkg status area: Read-only file system TrueNAS SCALE is not your typical Linux box. It\u0026rsquo;s an appliance with a read-only root filesystem. No installing packages, no loading kernel modules, no Redroid.\nResult: Dead on arrival.\nAttempt 2: Oracle Cloud A1 (Free Tier) Oracle Cloud\u0026rsquo;s Always Free tier includes up to 4 ARM OCPUs and 24GB RAM on their Ampere A1 instances. Free ARM compute in Sydney? Sign me up.\nCreated Oracle Cloud account Navigated to Compute → Create Instance Selected VM.Standard.A1.Flex Clicked Create 1 Out of capacity for shape VM.Standard.A1.Flex in availability domain AD-1. Tried AD-2. Same error. AD-3. Same.\nThis is Oracle Cloud\u0026rsquo;s dirty secret — the free tier instances are perpetually \u0026ldquo;out of capacity\u0026rdquo; in popular regions. Sydney? Forget it. You might get lucky at 3am on a Tuesday, but I wasn\u0026rsquo;t willing to write a script to spam the API.\nResult: Phantom free tier.\nAttempt 3: Oracle Cloud A2 (Paid) Fine. I\u0026rsquo;ll pay. Oracle A2 instances are the newer Ampere Altra processors. Still ARM, but with actual availability.\n1 2 3 4 5 6 7 8 9 10 11 12 13 # SSH into new A2 instance ssh ubuntu@\u0026lt;oracle-a2-ip\u0026gt; # Install Docker sudo apt install -y docker.io # Load binder module sudo modprobe binder_linux devices=binder,hwbinder,vndbinder # Run Redroid docker run -d --name redroid-11 --privileged \\ -p 5555:5555 \\ redroid/redroid:11.0.0-latest The container started. Docker logs showed Android booting. Then:\n1 init: cannot execv(\u0026#39;/system/bin/boringssl_self_test32\u0026#39;): Exec format error And the entire VM froze. SSH session dead. Console showed kernel panic.\nAfter some research: Oracle A2 instances don\u0026rsquo;t support 32-bit ARM binaries. The Ampere Altra processors are 64-bit only. Standard Redroid images include 32-bit compatibility libraries that cause immediate kernel panics.\nThe fix? Use Redroid\u0026rsquo;s _64only images:\n1 2 3 docker run -d --name redroid-13 --privileged \\ -p 5555:5555 \\ redroid/redroid:13.0.0_64only-latest But wait — there\u0026rsquo;s no Android 11 _64only image. The oldest is Android 12. And after the kernel panic, I was done fighting with Oracle.\nResult: Works in theory, terrifying in practice.\nAttempt 4: Hetzner ARM (CAX11) Hetzner\u0026rsquo;s ARM servers are cheap (~€4/month), available in multiple regions, and most importantly: they support 32-bit ARM binaries.\nCreated a CAX11 in Helsinki:\n1 2 3 4 5 6 7 8 9 10 11 12 13 # SSH in ssh root@\u0026lt;hetzner-ip\u0026gt; # Install Docker apt update \u0026amp;\u0026amp; apt install -y docker.io # Load binder module modprobe binder_linux devices=binder,hwbinder,vndbinder # Run Redroid docker run -d --name redroid-11 --privileged \\ -p 5555:5555 \\ redroid/redroid:11.0.0-latest It worked. Android 11 booted, ADB connected, scrcpy displayed the Android home screen.\nThen I tried to actually use it:\n1 tailscale ping \u0026lt;hetzner-tailscale-ip\u0026gt; 1 pong from hetzner-android (100.x.x.x) via DERP(fra) in 280ms 280 milliseconds. From New Zealand to Helsinki. Every tap, every swipe, every interaction — a quarter-second delay. For development testing, it was borderline unusable.\nResult: Works, but feels like using Android through molasses.\nAttempt 5: AWS Graviton (Sydney) At this point I was ready to throw money at the problem. AWS has Graviton instances in Sydney. Their t4g.small is free tier eligible (750 hours/month until December 2026). Let\u0026rsquo;s try it.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # Create t4g.small in ap-southeast-2 # SSH in ssh -i ~/.ssh/android.pem ubuntu@\u0026lt;aws-ip\u0026gt; # Install Docker and kernel modules sudo apt install -y docker.io linux-modules-extra-$(uname -r) # Load binder sudo modprobe binder_linux devices=binder,hwbinder,vndbinder # Mount binderfs sudo mkdir -p /dev/binderfs sudo mount -t binder binder /dev/binderfs # Run Redroid (64-bit only for Graviton) docker run -d --name redroid-13 --privileged \\ -p 5555:5555 \\ redroid/redroid:13.0.0_64only-latest Checked boot status:\n1 docker exec redroid-13 getprop sys.boot_completed 1 1 Connected from my PC:\n1 tailscale ping 100.66.154.79 1 2 3 pong from aws-android (100.66.154.79) via DERP(syd) in 28ms pong from aws-android (100.66.154.79) via DERP(syd) in 29ms pong from aws-android (100.66.154.79) via DERP(syd) in 28ms 28 milliseconds. Ten times faster than Hetzner. scrcpy was responsive, taps registered instantly, the whole experience felt local.\nResult: Finally.\nThe Scorecard Provider Region Cost Latency (NZ) Result TrueNAS Local Free 0ms Read-only filesystem Oracle A1 Sydney Free N/A \u0026ldquo;Out of capacity\u0026rdquo; forever Oracle A2 Sydney ~$15/mo N/A Kernel panics (no 32-bit) Hetzner CAX11 Helsinki ~€4/mo 280ms Works but unusable AWS t4g.small Sydney Free tier 28ms Works perfectly Lessons Learned 1. TrueNAS is an Appliance Don\u0026rsquo;t try to use TrueNAS SCALE as a general-purpose Linux box. It\u0026rsquo;s designed to run apps through their official app system, not arbitrary Docker containers with kernel module requirements.\n2. Oracle Cloud Free Tier is a Lie The A1 instances are theoretically free. In practice, they\u0026rsquo;re never available in useful regions. The paid A2 tier works but has its own problems (see below).\n3. Not All ARM is Equal Oracle A2 (Ampere Altra) and AWS Graviton processors are 64-bit only. Standard Redroid images include 32-bit ARM libraries that cause kernel panics. Always use _64only variants:\n1 2 3 redroid/redroid:12.0.0_64only-latest redroid/redroid:13.0.0_64only-latest redroid/redroid:14.0.0_64only-latest Hetzner CAX (Ampere eMAG) supports both 32-bit and 64-bit, so standard images work there.\n4. Geography Matters For interactive applications like scrcpy, latency is everything. 280ms makes an Android device feel broken. 28ms feels native. Pick a region close to you.\n5. Free Tier Math AWS t4g.small gives you 750 hours/month free — that\u0026rsquo;s 24/7 operation with room to spare. The free tier runs until December 2026. After that, it\u0026rsquo;s about $12/month in Sydney. Still cheaper than a dedicated Android device.\n6. Tailscale is the Answer Every provider in this journey used Tailscale for access. No public ports exposed, no firewall rules to manage, no VPN certificates to rotate. Just tailscale up and you\u0026rsquo;re connected.\nThe Final Setup 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Tailscale Mesh │ ┌─────────────────┼─────────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ Desktop │ │ Phone │ │ Laptop │ │ (WSL2) │ │ │ │ │ └─────┬─────┘ └───────────┘ └───────────┘ │ │ ADB + scrcpy │ 28ms RTT ▼ ┌───────────────────────────────────────────┐ │ AWS ap-southeast-2 │ │ ┌─────────────────────────────────────┐ │ │ │ t4g.small │ │ │ │ ┌──────────────────────────────┐ │ │ │ │ │ Redroid Container │ │ │ │ │ │ Android 13 (64-bit) │ │ │ │ │ │ Port 5555 │ │ │ │ │ └──────────────────────────────┘ │ │ │ └─────────────────────────────────────┘ │ └───────────────────────────────────────────┘ Quick Reference 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # Check latency tailscale ping \u0026lt;aws-tailscale-ip\u0026gt; # Connect ADB adb connect \u0026lt;aws-tailscale-ip\u0026gt;:5555 # Display with scrcpy (optimised for remote) scrcpy -s \u0026lt;aws-tailscale-ip\u0026gt;:5555 \\ --max-size 1024 \\ --video-bit-rate 4M \\ --stay-awake # Check Android boot status docker exec redroid-13 getprop sys.boot_completed # Shell into Android adb -s \u0026lt;aws-tailscale-ip\u0026gt;:5555 shell The full setup guide is in my home-ops repo.\n","date":"2026-01-11T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2026/cloud-provider-roulette-redroid/","title":"Cloud Provider Roulette: Finding a Home for Redroid"},{"content":" \u0026ldquo;Why pay for game server hosting when you have a 56-thread NAS sitting idle?\u0026rdquo; — Me, justifying another homelab project\nThe Problem: Game Server Sprawl I\u0026rsquo;ve been running various game servers over the years — Minecraft for the kids, Valheim with friends, the occasional ARK survival session. Each one was its own snowflake: different install methods, different backup strategies, different ways of breaking at 2am when someone actually wanted to play.\nWhat I wanted was something like Proxmox for VMs, but for game servers: a web UI where I could spin up a Minecraft server in 30 seconds, manage backups, and not have to SSH into anything unless something was on fire.\nEnter Pterodactyl.\nWhat is Pterodactyl? Pterodactyl is an open-source game server management panel. It\u0026rsquo;s what companies like Apex Hosting and Nodecraft use under the hood (or something similar). The architecture is split into two components:\nComponent Purpose Panel Web UI for managing servers, users, allocations Wings Daemon that actually runs the game server containers The Panel is a Laravel app — database, Redis, the usual web stack. Wings is a Go binary that talks to Docker and manages the actual game server containers.\nIn my setup:\nPanel runs on Kubernetes (GitOps, because everything in my homelab is GitOps) Wings runs on TrueNAS (where I have the storage and spare compute for game servers) Why TrueNAS for Wings? My Kubernetes cluster is three nodes with NVMe storage, optimised for services that need to be highly available. Game servers\u0026hellip; don\u0026rsquo;t really fit that mould. They\u0026rsquo;re stateful, they want lots of RAM and CPU, and if a Minecraft server goes down for 30 seconds during a node reboot, the kids will survive.\nTrueNAS, on the other hand, has:\n56 CPU threads (Xeon goodness) 128GB RAM Plenty of spinning rust for world saves Docker support via the app system Running Wings on TrueNAS means game servers get dedicated resources without competing with Grafana and Home Assistant for pod scheduling.\nThe Journey Part 1: Panel Deployment The Panel deployment was straightforward thanks to the bjw-s/app-template chart. The config lives in an ExternalSecret with all the Laravel bits:\n1 2 3 4 5 6 7 8 9 10 # The important environment variables APP_KEY: \u0026#34;{{ .PTERODACTYL_APP_KEY }}\u0026#34; # Laravel encryption key APP_URL: https://pterodactyl.nerdz.cloud DB_HOST: mariadb.database.svc.cluster.local REDIS_HOST: dragonfly.database.svc.cluster.local # S3 backups to MinIO APP_BACKUP_DRIVER: s3 AWS_ENDPOINT: http://citadel.internal:9000 AWS_BACKUPS_BUCKET: gameserver-backups The gotchas I hit:\nCADDY_APP_URL: \u0026quot;:80\u0026quot; — The container uses Caddy internally, and it needs this set TRUSTED_PROXIES — Must be CIDR notation (10.0.0.0/8,172.16.0.0/12,192.168.0.0/16), not * Database init scripts — Only run on first MariaDB deployment, so I had to manually create the database Part 2: DNS and Networking This is where it got interesting. Game servers need to be reachable from the internet, which means:\nA DNS record pointing to my external IP Port forwards through the UniFi gateway SSL certificates for the Wings API I created play.nerdz.cloud pointing to my external IP (not proxied through Cloudflare — game traffic needs direct access).\nPort forwards:\nPort Purpose 8443 Wings API (Panel ↔ Wings communication) 2022 SFTP (file uploads) 25565-25600 Game servers (36 allocations) Part 3: Wings Won\u0026rsquo;t Start — SSL Certificates First attempt at starting Wings:\n1 FATAL: failed to configure HTTPS server error=open /etc/letsencrypt/live/play.nerdz.cloud/fullchain.pem: no such file or directory Wings expects SSL certificates at a specific path. My options were:\nLet Wings auto-generate certs via Let\u0026rsquo;s Encrypt (requires port 80 forwarded) Provide my existing wildcard certificate I went with option 2 — my Kubernetes cluster already has a wildcard cert for *.nerdz.cloud via cert-manager. A quick export and copy later:\n1 2 3 4 5 6 7 8 # Export from Kubernetes kubectl get secret envoy-gateway-nerdz-cloud-tls -n network \\ -o jsonpath=\u0026#39;{.data.tls\\.crt}\u0026#39; | base64 -d \u0026gt; fullchain.pem kubectl get secret envoy-gateway-nerdz-cloud-tls -n network \\ -o jsonpath=\u0026#39;{.data.tls\\.key}\u0026#39; | base64 -d \u0026gt; privkey.pem # Copy to TrueNAS scp fullchain.pem privkey.pem truenas:/mnt/storage0/game-servers/wings/certs/ Then mount them in docker-compose:\n1 2 volumes: - \u0026#34;./certs:/etc/letsencrypt/live/play.nerdz.cloud\u0026#34; Part 4: DNS Resolution from Kubernetes The Panel needs to talk to Wings at play.nerdz.cloud:8443. When I tried to create a node, the Panel couldn\u0026rsquo;t resolve the hostname. After much head-scratching, I traced it to CoreDNS → node\u0026rsquo;s resolv.conf → systemd-resolved with stale cache.\nThe fix was adding public DNS servers to my Talos nodes:\n1 2 3 4 5 6 7 # kubernetes/bootstrap/talos/patches/global/local-dns.yaml machine: network: nameservers: - 10.90.254.1 # UDM Pro - 1.1.1.1 # Cloudflare - 8.8.8.8 # Google Applied via talosctl patch mc to each node — no reboot required since it\u0026rsquo;s a network config change.\nPart 5: The /tmp Mount Gotcha With Wings running and the node connected, I created my first Minecraft server. The Panel showed \u0026ldquo;Installing\u0026rdquo;\u0026hellip; and then nothing. Checking the Wings logs:\n1 2 3 ERROR: failed to run install process for server error=Error response from daemon: invalid mount config for type \u0026#34;bind\u0026#34;: bind source path does not exist: /tmp/pterodactyl/407c6b7d-cc34-4d31-a4e3-fa0e51265aa7 This one took a while to figure out. Wings creates install scripts in /tmp/pterodactyl/, then spawns a container to run them. The problem? My docker-compose had:\n1 2 volumes: - \u0026#34;./tmp:/tmp/pterodactyl/\u0026#34; # WRONG This creates the path inside the Wings container, but when Wings spawns the install container, Docker looks for /tmp/pterodactyl on the host filesystem. The paths need to match:\n1 2 volumes: - \u0026#34;/tmp/pterodactyl:/tmp/pterodactyl\u0026#34; # RIGHT - same path on host and container Create the directory and restart Wings:\n1 2 sudo mkdir -p /tmp/pterodactyl sudo docker compose down \u0026amp;\u0026amp; sudo docker compose up -d Part 6: EULA and First Boot With the mount fixed, the Minecraft Forge server installed successfully. Started it up and\u0026hellip; immediately exited with code 0. The logs showed:\n1 You need to agree to the EULA in order to run the server. Go to eula.txt for more info. Classic Minecraft. In Pterodactyl\u0026rsquo;s client view (not admin), go to Files, open eula.txt, change eula=false to eula=true, save, and start again.\n1 2 [Server thread/INFO] [minecraft/DedicatedServer]: Done (8.248s)! For help, type \u0026#34;help\u0026#34; Server marked as running... Part 7: Testing External Access The moment of truth. Connected from within my LAN first — worked. But that could just be hairpin NAT through the UDM.\nSwitched my phone to mobile data, opened Minecraft, added server play.nerdz.cloud:25565\u0026hellip; and I was in. External access confirmed.\n1 2 [User Authenticator #1/INFO]: UUID of player NZVengeance is 525f1ee0-b0cc-4e07-88d8-4d77ffe25e85 [Server thread/INFO]: NZVengeance joined the game Importing More Game Eggs With the infrastructure working, I wanted more than just Minecraft. Pterodactyl uses \u0026ldquo;eggs\u0026rdquo; — JSON templates that define how to install and run different game servers.\nThe community maintains hundreds of eggs at pelican-eggs:\nRepository Contents games-steamcmd 150+ Steam games minecraft All Minecraft variants games-standalone Non-Steam games Importing is straightforward: download the JSON, go to Admin → Nests → Import Egg. I added:\nSatisfactory Valheim Palworld ARK Survival Ascended Core Keeper Enshrouded Conan Exiles Icarus CurseForge (for modpack servers) Factorio Each game has different port requirements, so I\u0026rsquo;ll need to add more allocations and port forwards as I spin up servers.\nThe Final Architecture 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Internet │ ▼ ┌─────────────────┐ │ UniFi Gateway │ │ Port Forward │ └────────┬────────┘ │ ┌──────────────────────────┼──────────────────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ Pterodactyl │ │ Wings │ │ Game Servers │ │ Panel │◄───────►│ (TrueNAS) │◄───────►│ (Docker) │ │ (Kubernetes) │ │ Port 8443 │ │ Ports 25565+ │ └───────┬───────┘ └───────────────┘ └───────────────┘ │ ┌───────┴───────┐ │ │ ▼ ▼ ┌────────┐ ┌──────────┐ │MariaDB │ │Dragonfly │ │ (DB) │ │ (Cache) │ └────────┘ └──────────┘ Lessons Learned Wings needs SSL — Even for internal communication, Wings expects HTTPS. Either provide certs or disable SSL (not recommended).\nVolume mounts must match — When Wings spawns containers, bind mounts use host paths. If your docker-compose uses relative paths inside the container, the spawned containers won\u0026rsquo;t find them.\nDocker on TrueNAS works well — The network_mode: host requirement for Wings is handled fine, and having the volumes on ZFS gives me snapshot capabilities for free.\nDNS is always the problem — When in doubt, add more upstream DNS servers. Kubernetes pods relying on node DNS resolution is a foot-gun.\nWildcard certs are worth it — Having *.nerdz.cloud available meant I could just export and use it rather than setting up another Let\u0026rsquo;s Encrypt flow.\nSplit the control plane from the data plane — Panel in Kubernetes (HA, GitOps), Wings on TrueNAS (storage, compute). Best of both worlds.\nRead the logs — Both Pterodactyl and Wings have excellent logging. Every problem I hit was clearly explained in the logs once I actually looked.\nWhat\u0026rsquo;s Next Backup schedules — Configure automatic backups to the MinIO bucket User management — Create accounts so the kids can manage their own servers More port forwards — Different games need different ports (Valheim wants 2456-2457, Satisfactory wants 7777, etc.) Monitoring — Add Prometheus metrics for game server health Resources Pterodactyl Documentation Wings Docker Setup Pelican Eggs Repository My home-ops repo — Full GitOps setup including Pterodactyl manifests ","date":"2026-01-10T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2026/pterodactyl-truenas-setup/","title":"Running Game Servers from a NAS: Pterodactyl + TrueNAS"},{"content":" \u0026ldquo;Your backups are only as good as your last successful restore.\u0026rdquo;\nThe Discovery It started with qbittorrent refusing to authenticate. After the Ceph Reef to Tentacle upgrade, several apps needed restoring from backups. Routine stuff—trigger the VolSync ReplicationDestination, wait for completion, scale up the app.\nExcept the restored data was garbage.\n1 2 3 $ kubectl exec -n downloads deploy/qbittorrent -c app -- cat /config/qBittorrent/qBittorrent.conf [Preferences] WebUI\\Username=^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@... That\u0026rsquo;s not a username. That\u0026rsquo;s null bytes. The entire config file was zeroed out—the file existed, had the right size, but contained nothing but 0x00 characters.\nThe Pattern Emerges Checking other apps revealed the same problem:\n1 2 3 4 5 6 7 8 9 10 11 # Sabnzbd - entire config gone $ kubectl exec -n downloads deploy/sabnzbd -- ls -la /config/ total 4 drwxr-xr-x 3 apps apps 22 Dec 22 10:15 . drwxr-xr-x 1 root root 4096 Dec 22 10:15 .. drwx------ 2 apps apps 6 Dec 22 10:15 lost+found # Radarr - config.xml zeroed $ kubectl exec -n downloads deploy/radarr -- head -c 50 /config/config.xml | xxd 00000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 00000010: 0000 0000 0000 0000 0000 0000 0000 0000 ................ The common factor: all these PVCs were on ceph-filesystem storage class and had been restored via VolSync.\nUnderstanding the Bug CephFS handles sparse files differently than traditional filesystems. A sparse file is one where regions of null bytes aren\u0026rsquo;t actually stored on disk—they\u0026rsquo;re just metadata saying \u0026ldquo;this region is empty.\u0026rdquo;\nThe problem: when VolSync\u0026rsquo;s Kopia mover restores files to CephFS, something in the sparse file handling chain goes wrong. Files that should contain data get their content replaced with null bytes, while maintaining their original size and metadata.\nThis isn\u0026rsquo;t a VolSync bug or a Kopia bug. It\u0026rsquo;s a quirk of how CephFS handles certain write patterns during restore operations. The same restore to ceph-block storage works perfectly.\nThe Damage Assessment After checking all apps that used ceph-filesystem with VolSync backups:\nApp Status Impact qbittorrent Config zeroed Lost WebUI credentials, port settings sabnzbd Empty directory Lost entire config, server settings sonarr Config zeroed Minimal (uses PostgreSQL for data) sonarr-uhd Config zeroed Minimal (uses PostgreSQL for data) sonarr-foreign Config zeroed Minimal (uses PostgreSQL for data) radarr Config zeroed Minimal (uses PostgreSQL for data) radarr-uhd Config zeroed Minimal (uses PostgreSQL for data) filebrowser Config zeroed Lost user settings The sonarr and radarr instances were lucky—they store actual data in PostgreSQL, so the zeroed config.xml only meant losing some network settings. But qbittorrent and sabnzbd were serious losses.\nRecovery Strategy The immediate fix was obvious: stop using ceph-filesystem for VolSync-backed PVCs. But first, I needed to recover the data.\nAttempt 1: Kopia Snapshots with previous: N Kopia stores multiple snapshots. The previous parameter tells the ReplicationDestination to restore an older snapshot:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 apiVersion: volsync.backube/v1alpha1 kind: ReplicationDestination metadata: name: sabnzbd-test-restore namespace: downloads spec: trigger: manual: test-restore-1 kopia: repository: sabnzbd-volsync-secret destinationPVC: sabnzbd-test copyMethod: Snapshot snapshotClassName: csi-ceph-block storageClassName: ceph-block # Not ceph-filesystem! previous: 3 # Go back 3 snapshots I tried previous: 3, previous: 7, previous: 10, even previous: 13. Every single snapshot was empty.\nThe CephFS corruption happened before the Kopia migration. All Kopia snapshots were backing up already-corrupted data.\nAttempt 2: Kopia with restoreAsOf Maybe the corruption was more recent? Kopia\u0026rsquo;s restoreAsOf parameter restores from the most recent snapshot before a given timestamp:\n1 2 3 spec: kopia: restoreAsOf: \u0026#34;2025-12-10T23:59:59Z\u0026#34; # Day before Kopia migration Same result. Empty. The corruption predated any Kopia backup.\nAttempt 3: Old Restic Backups Before migrating to Kopia on December 11th, I had Restic backups going to Backblaze B2. Those old backups might still have good data.\nThe Restic backup bucket (nerdz-volsync) was separate from the Kopia bucket (nerdz-volsync-kopia). I still had the credentials in 1Password.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 apiVersion: volsync.backube/v1alpha1 kind: ReplicationDestination metadata: name: sabnzbd-restic-restore namespace: downloads spec: trigger: manual: restic-restore restic: repository: sabnzbd-volsync-restic-secret destinationPVC: sabnzbd-test copyMethod: Direct storageClassName: ceph-block restoreAsOf: \u0026#34;2025-12-10T23:59:59Z\u0026#34; moverSecurityContext: runAsUser: 568 runAsGroup: 568 fsGroup: 568 1 2 3 4 5 6 7 8 9 $ kubectl exec debug-pod -- ls -la /mnt/sabnzbd-test/ drwxr-xr-x 5 apps apps 101 Dec 10 03:15 . -rw-r--r-- 1 apps apps 8234 Dec 10 03:15 sabnzbd.ini drwxr-xr-x 2 apps apps 45 Dec 9 12:30 admin $ kubectl exec debug-pod -- grep -A2 \u0026#34;\\[servers\\]\u0026#34; /mnt/sabnzbd-test/sabnzbd.ini [servers] [[Frugal EU]] host = reader.frugalusenet.com Success! The December 10th Restic backup had the full config with all my Usenet server settings.\nThe Recovery Process Step 1: Create the Restic Restore Component I created a one-time-use component specifically for Restic restores:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 # kubernetes/components/volsync-restic-restore/replicationdestination.yaml apiVersion: volsync.backube/v1alpha1 kind: ReplicationDestination metadata: name: \u0026#34;${APP}-restic-dst\u0026#34; spec: trigger: manual: restore-once restic: repository: ${APP}-volsync-restic-secret destinationPVC: ${APP} copyMethod: Direct storageClassName: ${VOLSYNC_STORAGECLASS:=ceph-block} restoreAsOf: \u0026#34;${RESTIC_RESTORE_AS_OF:=2025-12-10T23:59:59Z}\u0026#34; Step 2: Migrate Each App to ceph-block For each affected app:\nScale down the deployment Delete the corrupted PVC Create new PVC on ceph-block Restore from Restic backup Update ks.yaml to use ceph-block going forward Scale up and verify 1 2 3 4 5 6 7 8 9 # Example for sabnzbd flux suspend kustomization sabnzbd -n downloads kubectl scale deploy sabnzbd -n downloads --replicas=0 kubectl delete pvc sabnzbd -n downloads # Apply the restic restore component # Wait for ReplicationDestination to complete flux resume kustomization sabnzbd -n downloads Step 3: Verify and Create Fresh Backups After confirming each app had valid data, I triggered fresh backups to all three destinations:\n1 2 3 4 5 6 7 8 9 10 11 # NFS backup kubectl patch replicationsource sabnzbd -n downloads --type=merge \\ -p \u0026#39;{\u0026#34;spec\u0026#34;:{\u0026#34;trigger\u0026#34;:{\u0026#34;manual\u0026#34;:\u0026#34;fresh-backup-nfs\u0026#34;}}}\u0026#39; # Backblaze B2 backup kubectl patch replicationsource sabnzbd-b2 -n downloads --type=merge \\ -p \u0026#39;{\u0026#34;spec\u0026#34;:{\u0026#34;trigger\u0026#34;:{\u0026#34;manual\u0026#34;:\u0026#34;fresh-backup-b2\u0026#34;}}}\u0026#39; # Cloudflare R2 backup kubectl patch replicationsource sabnzbd-r2 -n downloads --type=merge \\ -p \u0026#39;{\u0026#34;spec\u0026#34;:{\u0026#34;trigger\u0026#34;:{\u0026#34;manual\u0026#34;:\u0026#34;fresh-backup-r2\u0026#34;}}}\u0026#39; The Flux Alert Spam After fixing all the apps, I got bombarded with Flux alerts:\n1 2 PersistentVolumeClaim/downloads/sonarr-foreign dry-run failed (Invalid): PersistentVolumeClaim \u0026#39;sonarr-foreign\u0026#39; is invalid: spec: Forbidden: spec is immutable The volsync component\u0026rsquo;s PVC template includes a dataSourceRef pointing to the ReplicationDestination. For existing PVCs, this causes a conflict—you can\u0026rsquo;t add a dataSourceRef after creation.\nThe fix was adding the IfNotPresent SSA label to the PVC template:\n1 2 3 4 5 6 7 8 9 10 11 12 # kubernetes/components/volsync/nfs-truenas/pvc.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: ${APP} labels: kustomize.toolkit.fluxcd.io/ssa: IfNotPresent # Don\u0026#39;t update if exists spec: dataSourceRef: kind: ReplicationDestination apiGroup: volsync.backube name: ${APP}-dst This tells Flux: \u0026ldquo;Create this PVC if it doesn\u0026rsquo;t exist, but don\u0026rsquo;t try to update existing ones.\u0026rdquo;\nLessons Learned Assumption Reality CephFS works fine for all workloads Sparse file handling during restores can corrupt data Kopia backups are good if they complete They can back up already-corrupted data perfectly previous: N is a time machine Only if the data was good when backed up Old backup systems can be deleted after migration Keep them until you\u0026rsquo;ve verified restores work All my apps use PostgreSQL for data qbittorrent and sabnzbd use local config files The 3-2-1-1 Backup Strategy After this incident, I\u0026rsquo;ve upgraded from 3-2-1 to 3-2-1-1:\n3 copies of data 2 different storage types 1 offsite copy 1 air-gapped or delayed-deletion copy The old Restic backups in B2 were essentially an air-gapped backup—I hadn\u0026rsquo;t deleted them after the Kopia migration. That laziness saved my data.\nStorage Class Selection Going forward, all VolSync-backed PVCs use ceph-block:\nUse Case Storage Class App config/data backed by VolSync ceph-block Shared working storage (media processing) ceph-filesystem Databases (backed by pgBackRest) ceph-block Temporary/cache data openebs-hostpath CephFS is still useful for ReadWriteMany workloads where multiple pods need access to the same files. Just don\u0026rsquo;t use it for data that needs to survive restore operations.\nUpdate 2025-12-23: CSI Read Affinity After discussing this issue with the home-operations community, I discovered another contributing factor: CSI Read Affinity.\n1 2 3 4 5 # kubernetes/apps/rook-ceph/rook-ceph/cluster/helmrelease.yaml cephClusterSpec: csi: readAffinity: enabled: true # THIS CAUSES PROBLEMS This setting makes the Ceph CSI driver prefer reading from OSDs on the same node as the pod. While this sounds like a performance optimization, it can cause data consistency issues with CephFS—particularly with sparse file handling during restore operations.\nThe fix: Disable it.\n1 2 3 csi: readAffinity: enabled: false If you\u0026rsquo;re experiencing CephFS data corruption, check this setting first.\nQuick Reference 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 # Check for sparse file corruption kubectl exec -n \u0026lt;namespace\u0026gt; deploy/\u0026lt;app\u0026gt; -- od -c /config/config.xml | head -5 # If you see \u0026#34;0000000 \\0 \\0 \\0 \\0...\u0026#34; - it\u0026#39;s zeroed # Restore from old Restic backup # 1. Create secret with old Restic credentials kubectl create secret generic ${APP}-volsync-restic-secret \\ --from-literal=RESTIC_REPOSITORY=s3:s3.us-east-005.backblazeb2.com/nerdz-volsync/${APP} \\ --from-literal=RESTIC_PASSWORD=\u0026lt;password\u0026gt; \\ --from-literal=AWS_ACCESS_KEY_ID=\u0026lt;key\u0026gt; \\ --from-literal=AWS_SECRET_ACCESS_KEY=\u0026lt;secret\u0026gt; \\ -n \u0026lt;namespace\u0026gt; # 2. Create ReplicationDestination with restoreAsOf # 3. Trigger restore with: kubectl patch ... manual: restore-now # Force fresh backup to all destinations for suffix in \u0026#34;\u0026#34; \u0026#34;-b2\u0026#34; \u0026#34;-r2\u0026#34;; do kubectl patch replicationsource ${APP}${suffix} -n \u0026lt;namespace\u0026gt; --type=merge \\ -p \u0026#39;{\u0026#34;spec\u0026#34;:{\u0026#34;trigger\u0026#34;:{\u0026#34;manual\u0026#34;:\u0026#34;fresh-\u0026#39;$(date +%s)\u0026#39;\u0026#34;}}}\u0026#39; done # Check backup status kubectl get replicationsource -n \u0026lt;namespace\u0026gt; Final Thoughts Data corruption is insidious. The files looked normal—right names, right sizes, right permissions. Only the content was wrong. Without actually reading the files, there was no indication anything was broken.\nThis is why backup verification matters. Not \u0026ldquo;did the backup job complete successfully,\u0026rdquo; but \u0026ldquo;can I actually restore and use the data.\u0026rdquo; I\u0026rsquo;ve added a monthly calendar reminder to do test restores.\nThe silver lining: this forced me to audit all my apps and migrate everything to consistent storage classes. The cluster is more robust now than before the incident.\nThis post documents the recovery from my home-ops cluster. The original VolSync Kopia migration is documented in my previous post.\n","date":"2025-12-22T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2025/cephfs-sparse-file-corruption/","title":"CephFS Sparse File Corruption: A Data Recovery Story"},{"content":" \u0026ldquo;You can\u0026rsquo;t skip Ceph versions. But you can be wrong about Rook constraints.\u0026rdquo;\nThe Situation Reef (v18) end-of-life is August 2025. Tentacle (v20) shipped in November 2025. Time to upgrade.\nI initially wrote a combined \u0026ldquo;Reef to Tentacle\u0026rdquo; upgrade guide, planning to do both hops in sequence. Then I read the Rook compatibility matrix and concluded I needed to wait for Rook v1.19 to get Tentacle support.\nI was wrong.\nThe Rook Constraint (Corrected) Here\u0026rsquo;s what I originally thought:\nRook Version Supported Ceph Versions v1.17.x Reef only v1.18.x Reef + Squid v1.19+ Squid + Tentacle (drops Reef) The reality: Rook v1.18.8 added Tentacle support. You don\u0026rsquo;t need to wait for v1.19.\nFrom this blog post:\n\u0026ldquo;If you\u0026rsquo;re running Ceph within Kubernetes using Rook Ceph, and you want to use Tentacle without the unsupported flag, you need to update at least to version v1.18.8.\u0026rdquo;\nSo the actual support matrix is:\nRook Version Supported Ceph Versions v1.18.0 - v1.18.7 Reef + Squid v1.18.8+ Reef + Squid + Tentacle The upgrade path still requires going through Squid—you can\u0026rsquo;t skip versions—but you can do both hops on Rook v1.18.8.\nMy Starting Point 1 2 3 4 5 $ kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph version ceph version 18.2.7 (2cf3b0098dc3cbb1b6f2e8d8ed9df8c65b6aee53) reef (stable) $ kubectl -n rook-ceph get deploy rook-ceph-operator -o jsonpath=\u0026#39;{.spec.template.spec.containers[0].image}\u0026#39; ghcr.io/rook/ceph:v1.18.8 Reef v18.2.7 on Rook v1.18.8. Good to go for the full journey.\nPhase 1: Reef to Squid Step 1: Backup Everything Ceph upgrades are one-way. Once you run require-osd-release squid, there\u0026rsquo;s no going back to Reef.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 BACKUP_DIR=~/backups/ceph/migration-$(date +%Y%m%d) mkdir -p $BACKUP_DIR # Cluster state kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status \u0026gt; $BACKUP_DIR/ceph-status.txt kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd tree \u0026gt; $BACKUP_DIR/osd-tree.txt kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd crush dump \u0026gt; $BACKUP_DIR/crush-map.json kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph config dump \u0026gt; $BACKUP_DIR/config-dump.txt kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph versions \u0026gt; $BACKUP_DIR/versions.txt # Kubernetes resources kubectl -n rook-ceph get cephcluster -o yaml \u0026gt; $BACKUP_DIR/cephcluster.yaml kubectl get pods -n rook-ceph -o wide \u0026gt; $BACKUP_DIR/pods.txt kubectl get pvc -A | grep -E \u0026#34;ceph-block|ceph-filesystem\u0026#34; \u0026gt; $BACKUP_DIR/pvcs.txt Step 2: Set Safety Flags These prevent Ceph from marking OSDs as \u0026ldquo;out\u0026rdquo; and rebalancing data during the rolling restart:\n1 2 3 4 5 6 kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd set noout kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd set norebalance # Verify kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd dump | grep flags # flags noout,norebalance Without these flags, Ceph gets nervous when daemons restart and starts shuffling data around. That slows down the upgrade and adds risk.\nStep 3: Update the HelmRelease The actual change is one line:\n1 2 3 4 5 # kubernetes/apps/rook-ceph/rook-ceph/cluster/helmrelease.yaml cephClusterSpec: cephVersion: image: quay.io/ceph/ceph:v19.2.3-20250717 # Was v18.2.7 allowUnsupported: false Note the build-specific tag (v19.2.3-20250717). Don\u0026rsquo;t use just v19.2.3—the build suffix ensures you get a specific, tested image rather than whatever \u0026ldquo;latest v19.2.3\u0026rdquo; happens to be.\nStep 4: Deploy via GitOps 1 2 3 git add kubernetes/apps/rook-ceph/rook-ceph/cluster/helmrelease.yaml git commit -m \u0026#34;feat(rook-ceph): upgrade Ceph from Reef v18.2.7 to Squid v19.2.3\u0026#34; git push Flux picks up the change and triggers Rook to start the rolling upgrade.\nStep 5: Watch the Rolling Upgrade Rook upgrades daemons in a specific order: MON → MGR → MDS → OSD → RGW.\n1 2 3 4 5 6 7 8 # Terminal 1: Watch pods kubectl -n rook-ceph get pods -w # Terminal 2: Watch Ceph status watch -n 10 \u0026#39;kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph -s\u0026#39; # Periodically check versions kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph versions During the upgrade, you\u0026rsquo;ll see brief HEALTH_WARN states as daemons restart. This is normal. Only worry if you see HEALTH_ERR persisting for more than 10 minutes.\nMy 3-node cluster upgraded in about 3 minutes:\n1 2 3 4 5 6 7 { \u0026#34;mon\u0026#34;: { \u0026#34;ceph version 19.2.3 (...) squid (stable)\u0026#34;: 3 }, \u0026#34;mgr\u0026#34;: { \u0026#34;ceph version 19.2.3 (...) squid (stable)\u0026#34;: 2 }, \u0026#34;osd\u0026#34;: { \u0026#34;ceph version 19.2.3 (...) squid (stable)\u0026#34;: 3 }, \u0026#34;mds\u0026#34;: { \u0026#34;ceph version 19.2.3 (...) squid (stable)\u0026#34;: 2 }, \u0026#34;rgw\u0026#34;: { \u0026#34;ceph version 19.2.3 (...) squid (stable)\u0026#34;: 2 } } Step 6: Finalize the Squid Upgrade This step is critical. It tells Ceph that all OSDs are now Squid-capable, enabling Squid-specific features:\n1 2 3 4 5 6 7 8 9 10 # Check current state kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd dump | grep require_osd_release # require_osd_release reef # Set Squid requirement (NO GOING BACK AFTER THIS) kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd require-osd-release squid # Verify kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd dump | grep require_osd_release # require_osd_release squid Step 7: Unset Safety Flags 1 2 kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd unset noout kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd unset norebalance Step 8: Verify Health 1 2 3 4 5 $ kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph health detail HEALTH_OK $ kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph pg stat 169 pgs: 169 active+clean Phase 1 complete. Cluster is on Squid and healthy.\nPhase 2: Squid to Tentacle After running Squid stably for a while (in my case, a few hours—I was impatient), I proceeded with the Tentacle upgrade.\nTentacle Breaking Changes Before upgrading, I checked the Tentacle release notes for breaking changes:\nChange Impact on My Cluster RGW tenant-level IAM deprecated Not using tenant IAM. No impact. restful mgr module removed Not using REST API module. No impact. zabbix mgr module removed Not using Zabbix. No impact. Erasure coding default changed to ISA-L Only affects new pools. Existing pools unchanged. osd_repair_during_recovery option removed Not in my config. No impact. Step 1: Backup Again Same process, new directory:\n1 2 3 4 5 6 7 8 9 10 BACKUP_DIR=~/backups/ceph/squid-to-tentacle-$(date +%Y%m%d) mkdir -p $BACKUP_DIR kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status \u0026gt; $BACKUP_DIR/ceph-status.txt kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd tree \u0026gt; $BACKUP_DIR/osd-tree.txt kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd crush dump \u0026gt; $BACKUP_DIR/crush-map.json kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph config dump \u0026gt; $BACKUP_DIR/config-dump.txt kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph versions \u0026gt; $BACKUP_DIR/versions.txt kubectl -n rook-ceph get cephcluster -o yaml \u0026gt; $BACKUP_DIR/cephcluster.yaml kubectl get pods -n rook-ceph -o wide \u0026gt; $BACKUP_DIR/pods.txt Step 2: Set Safety Flags 1 2 kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd set noout kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd set norebalance Step 3: Update HelmRelease to Tentacle 1 2 3 4 5 # kubernetes/apps/rook-ceph/rook-ceph/cluster/helmrelease.yaml cephClusterSpec: cephVersion: image: quay.io/ceph/ceph:v20.2.0-20251104 # Was v19.2.3-20250717 allowUnsupported: false Step 4: Deploy via GitOps 1 2 3 git add kubernetes/apps/rook-ceph/rook-ceph/cluster/helmrelease.yaml git commit -m \u0026#34;feat(rook-ceph): upgrade Ceph from Squid v19.2.3 to Tentacle v20.2.0\u0026#34; git push Step 5: Watch the Rolling Upgrade Same monitoring as before:\n1 2 3 kubectl -n rook-ceph get pods -w watch -n 10 \u0026#39;kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph -s\u0026#39; kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph versions The upgrade took about 3-4 minutes. I watched the versions transition:\n1 2 3 4 5 MONs: 2/3 Tentacle... 3/3 Tentacle ✓ MGRs: 0/2 Tentacle... 2/2 Tentacle ✓ MDS: 0/2 Tentacle... 2/2 Tentacle ✓ OSDs: 0/3 Tentacle... 1/3... 2/3... 3/3 Tentacle ✓ RGWs: 0/2 Tentacle... 1/2... 2/2 Tentacle ✓ Final state:\n1 2 3 4 5 6 7 { \u0026#34;mon\u0026#34;: { \u0026#34;ceph version 20.2.0 (...) tentacle (stable)\u0026#34;: 3 }, \u0026#34;mgr\u0026#34;: { \u0026#34;ceph version 20.2.0 (...) tentacle (stable)\u0026#34;: 2 }, \u0026#34;osd\u0026#34;: { \u0026#34;ceph version 20.2.0 (...) tentacle (stable)\u0026#34;: 3 }, \u0026#34;mds\u0026#34;: { \u0026#34;ceph version 20.2.0 (...) tentacle (stable)\u0026#34;: 2 }, \u0026#34;rgw\u0026#34;: { \u0026#34;ceph version 20.2.0 (...) tentacle (stable)\u0026#34;: 2 } } Step 6: Finalize the Tentacle Upgrade Interestingly, this was already set automatically:\n1 2 $ kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd dump | grep require_osd_release require_osd_release tentacle Rook must have set it after detecting all OSDs were on Tentacle. If yours isn\u0026rsquo;t set:\n1 kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd require-osd-release tentacle Step 7: Unset Safety Flags 1 2 kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd unset noout kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd unset norebalance Step 8: Final Health Check 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 $ kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph health detail HEALTH_OK $ kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph -s cluster: id: 3b3d504b-96b4-4102-9ce3-c91b6bc2948d health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 6m) mgr: b(active, since 5m), standbys: a mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 2m), 3 in (since 4w) rgw: 2 daemons active (2 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 55.48k objects, 104 GiB usage: 305 GiB used, 4.9 TiB / 5.2 TiB avail pgs: 169 active+clean Done. Reef → Squid → Tentacle complete.\nWhat About the Toolbox? One thing that caught me off guard during Reef → Squid: the Ceph toolbox doesn\u0026rsquo;t auto-upgrade with the cluster. It\u0026rsquo;s a separate deployment.\nTurns out Rook does update it automatically now (at least in v1.18.8). But if yours is still on an old version:\n1 kubectl -n rook-ceph set image deploy/rook-ceph-tools rook-ceph-tools=quay.io/ceph/ceph:v20.2.0-20251104 Using mismatched toolbox and cluster versions can cause confusing behavior—the CLI tools might not understand newer cluster features.\nLessons Learned Assumption Reality Can upgrade Reef → Tentacle directly Must go Reef → Squid → Tentacle, one version at a time Need Rook v1.19 for Tentacle Rook v1.18.8 already supports Tentacle Toolbox auto-upgrades with cluster It does now, but verify anyway Generic version tags are fine Use build-specific tags (e.g., v20.2.0-20251104) for reproducibility require_osd_release needs manual setting Rook set it automatically for Tentacle (but verify) Quick Reference 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # Check current versions kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph version kubectl -n rook-ceph get deploy rook-ceph-operator -o jsonpath=\u0026#39;{.spec.template.spec.containers[0].image}\u0026#39; # Set/unset safety flags kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd set noout kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd set norebalance kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd unset noout kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd unset norebalance # Finalize upgrade (NO UNDO) kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd require-osd-release tentacle # Check health kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph health detail kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph pg stat References Rook Ceph Upgrade Guide Ceph Squid Release Notes Ceph Tentacle Release Notes Quay.io Ceph Tags Tentacle + Rook v1.18.8 Blog Post The upgrade guides are in my home-ops repository under docs/Guides/Storage/.\n","date":"2025-12-20T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2025/ceph-reef-to-tentacle-upgrade/","title":"Upgrading Ceph from Reef to Tentacle in a Rook-Managed Cluster"},{"content":" \u0026ldquo;I thought BGP was supposed to solve this hairpin nonsense I had with L2?\u0026rdquo; — Me, staring at timeout errors\nThe Setup Last week I migrated from Cilium L2 announcements to BGP for LoadBalancer IP advertisement. The migration went smoothly — BGP sessions established, routes advertised, services reachable. I patted myself on the back for solving the hairpin routing issues that plagued L2 announcements.\nThen qui stopped working.\nThe Symptom The qui pod (a qBittorrent web UI) was failing its startup probe:\n1 2 3 Warning Unhealthy 4s (x14 over 70s) kubelet Startup probe failed: Get \u0026#34;http://10.69.0.144:7476/health\u0026#34;: context deadline exceeded (Client.Timeout exceeded while awaiting headers) Looking at the pod logs told the real story:\n1 2 3 {\u0026#34;level\u0026#34;:\u0026#34;warn\u0026#34;,\u0026#34;error\u0026#34;:\u0026#34;Get \\\u0026#34;https://id.nerdz.cloud/.well-known/openid-configuration\\\u0026#34;: dial tcp 10.99.8.202:443: i/o timeout\u0026#34;,\u0026#34;attempt\u0026#34;:1,\u0026#34;issuer\u0026#34;:\u0026#34;https://id.nerdz.cloud\u0026#34;, \u0026#34;message\u0026#34;:\u0026#34;failed to initialize OIDC provider candidate\u0026#34;} The app was timing out trying to reach my OIDC provider at id.nerdz.cloud. The health endpoint wasn\u0026rsquo;t responding because the app was stuck waiting for OIDC initialization.\nThe Investigation My first instinct: test connectivity from the pod.\n1 2 kubectl exec -n downloads qui-79cf57dcb-5qss8 -- wget -qO- --timeout=5 \\ https://id.nerdz.cloud/.well-known/openid-configuration Timeout.\nBut wait — from a debug pod on a different node:\n1 2 3 kubectl run debug --image=busybox --restart=Never -- sleep 300 kubectl exec debug -- wget -qO- --timeout=5 \\ https://id.nerdz.cloud/.well-known/openid-configuration Success. Full JSON response.\nThe pattern emerged: pods on stanton-01 couldn\u0026rsquo;t reach 10.99.8.202 (the LoadBalancer VIP for my internal gateway), but pods on stanton-02 and stanton-03 could.\nWhat was special about stanton-01? Let\u0026rsquo;s check:\n1 kubectl get pods -n network -l gateway.envoyproxy.io/owning-gateway-name=internal -o wide 1 2 NAME NODE envoy-network-internal-f0b82637-c98c4cbd8-c8mjh stanton-01 The envoy-internal pod — the backend for 10.99.8.202 — was running on stanton-01. Same node as the failing qui pod.\nThe Root Cause: DSR and Same-Node Hairpin Cilium\u0026rsquo;s DSR (Direct Server Return) mode is great for external traffic:\nClient sends packet to LoadBalancer VIP Router (UDM Pro) forwards to a cluster node via BGP Cilium DNATs to the backend pod Backend sends response directly back to client (skipping the load balancer) This preserves client IPs and reduces latency. But it breaks when the source and destination are on the same node.\nHere\u0026rsquo;s what happens when a pod on stanton-01 tries to reach 10.99.8.202:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 qui pod (stanton-01) │ │ 1. Send packet to 10.99.8.202 ▼ Cilium BPF (stanton-01) │ │ 2. \u0026#34;10.99.8.202 is an ExternalIP, not in my routing table\u0026#34; │ Route to default gateway (UDM Pro) ▼ UDM Pro │ │ 3. BGP route says 10.99.8.202 → stanton-01 │ Send back to stanton-01 ▼ ??? │ │ 4. Packet bounces or gets dropped ▼ Timeout The problem is that Cilium\u0026rsquo;s DSR mode doesn\u0026rsquo;t intercept traffic to LoadBalancer VIPs from pods — it\u0026rsquo;s designed for external traffic entering the cluster. The packet goes out to the router, the router sends it back, and something breaks in the return path.\nThis is a known Cilium limitation. The GitHub issue title says it all: \u0026ldquo;Pods are not able to reach Cilium-managed LoadBalancer IP.\u0026rdquo;\nWhat I Tried (And What Didn\u0026rsquo;t Work) Attempt 1: Socket LB The first suggestion in the docs: enable socket-level load balancing.\n1 2 3 # helm-values.yaml socketLB: enabled: true Socket LB intercepts connections at the connect() syscall, before packets hit the network. In theory, it should handle hairpin traffic. In practice:\n1 2 kubectl exec debug-fresh -- curl --connect-timeout 5 http://10.99.8.202:80 # Connection timed out after 5002 milliseconds Socket LB helps with ClusterIP services, but it doesn\u0026rsquo;t intercept ExternalIP/LoadBalancer traffic.\nAttempt 2: Hybrid Mode Maybe the problem is pure DSR. What about hybrid mode?\n1 2 loadBalancer: mode: hybrid # DSR for external, SNAT for internal Nope. Still timing out.\nAttempt 3: lbExternalClusterIP There\u0026rsquo;s an option to allow cluster-internal access to external LB IPs:\n1 2 loadBalancer: lbExternalClusterIP: true Also didn\u0026rsquo;t work.\nAt this point I\u0026rsquo;d tried everything in the Cilium docs. The problem is fundamental to how BGP + native routing + DSR interact. Pods simply can\u0026rsquo;t reach LoadBalancer VIPs via the normal packet path when the backend is on the same node.\nThe Solution: CoreDNS Rewriting If I can\u0026rsquo;t fix the network path, I can fix the DNS. The key insight: pods don\u0026rsquo;t need to hit the LoadBalancer VIP — they can use the ClusterIP instead.\nClient Should resolve to External (internet) LoadBalancer VIP 10.99.8.202 Pod (internal) ClusterIP 10.96.9.253 CoreDNS can make this happen with the template plugin:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 # kubernetes/apps/kube-system/coredns/app/helm-values.yaml servers: # Internal gateway rewrite - resolves internal services to ClusterIP # This works around Cilium DSR hairpin limitation - zones: - zone: ${SECRET_DOMAIN} scheme: dns:// port: 53 plugins: - name: errors - name: log configBlock: |- class error - name: template parameters: IN A configBlock: |- match (^internal\\.|^id\\.)nerdz\\.cloud\\.$ answer \u0026#34;{{ .Name }} 60 IN A ${ENVOY_INTERNAL_CLUSTERIP}\u0026#34; fallthrough - name: template parameters: IN AAAA configBlock: |- match (^internal\\.|^id\\.)nerdz\\.cloud\\.$ rcode NOERROR fallthrough - name: forward parameters: . /etc/resolv.conf - name: cache parameters: 30 The ENVOY_INTERNAL_CLUSTERIP variable is defined in cluster-settings.yaml and substituted by Flux. This way if the ClusterIP ever changes (e.g., if the service is recreated), you only need to update one place.\nThe template matches queries for internal.nerdz.cloud and id.nerdz.cloud and returns the ClusterIP instead of forwarding to external DNS (which would return the LoadBalancer VIP).\nWhy Both Domains? My DNS is set up with a CNAME:\n1 id.nerdz.cloud → internal.nerdz.cloud → 10.99.8.202 (LB VIP) If I only intercept internal.nerdz.cloud, the CNAME lookup for id.nerdz.cloud goes to external DNS, which returns the CNAME, and then the A record lookup for internal.nerdz.cloud also goes to external DNS. The whole chain bypasses my template.\nBy intercepting both domains, I catch the query before it ever leaves the cluster.\nVerification After deploying the CoreDNS change:\n1 2 # From a pod on stanton-01 kubectl exec debug-dns -- nslookup id.nerdz.cloud 1 2 3 4 5 Server: 10.96.0.10 Address: 10.96.0.10#53 Name: id.nerdz.cloud Address: 10.96.9.253 # ClusterIP, not LB VIP! And the OIDC endpoint:\n1 kubectl exec debug-dns -- curl -s https://id.nerdz.cloud/.well-known/openid-configuration | head -c 100 1 {\u0026#34;authorization_endpoint\u0026#34;:\u0026#34;https://id.nerdz.cloud/authorize\u0026#34;,\u0026#34;authorization_response_iss_parameter_supported\u0026#34;:true... The qui pod now starts without timing out:\n1 kubectl get pods -n downloads -l app.kubernetes.io/name=qui 1 2 NAME READY STATUS RESTARTS AGE qui-79cf57dcb-hn6dj 1/1 Running 0 2m The Trade-off This solution has a subtle implication: internal pod traffic to id.nerdz.cloud won\u0026rsquo;t hit the LoadBalancer layer. It goes directly to the envoy-internal ClusterIP.\nIn practice, this doesn\u0026rsquo;t matter — ClusterIP load balancing is still handled by Cilium, and there\u0026rsquo;s only one replica of envoy-internal anyway. But if you\u0026rsquo;re doing something fancy with LoadBalancer-level policies or metrics, be aware that pod traffic bypasses that layer.\nLessons Learned BGP fixes L2 hairpin, not DSR hairpin — L2 announcements had hairpin issues because one node \u0026ldquo;owned\u0026rdquo; the IP via ARP. BGP fixed that by routing through the router. But DSR mode has its own hairpin problem when source and backend share a node.\nClusterIP always works — When in doubt, use ClusterIP for pod-to-service traffic. It\u0026rsquo;s what Kubernetes expects.\nDNS is a powerful escape hatch — When the network layer is fighting you, sometimes the answer is above the network layer. CoreDNS templates let you intercept and rewrite queries before they ever become packets.\nTest from the same node — My initial testing from different nodes missed the issue entirely. When debugging network problems, always test from the specific node(s) having issues.\nRead the GitHub issues — The Cilium issue tracker is full of people hitting this exact problem. I could have saved hours by searching for \u0026ldquo;pods cannot reach LoadBalancer IP\u0026rdquo; earlier.\nThe Fix in Context Problem Solution L2 hairpin (ARP ownership) Migrate to BGP DSR same-node hairpin CoreDNS rewrite to ClusterIP General pod-to-LB access Use ClusterIP or service DNS BGP was the right call for L2 hairpin. But DSR mode introduced a new hairpin scenario that BGP doesn\u0026rsquo;t solve. CoreDNS rewriting is the cleanest workaround I\u0026rsquo;ve found — it\u0026rsquo;s targeted, doesn\u0026rsquo;t require Cilium changes, and works regardless of pod scheduling.\nReferences Cilium Issue #39198 — Pods are not able to reach Cilium-managed LoadBalancer IP Cilium DSR Documentation — How DSR mode works CoreDNS Template Plugin — Synthesizing DNS responses From L2 to BGP — The migration that started this journey This post documents a debugging session with help from Claude Code. The CoreDNS workaround is now part of my home-ops repository.\n","date":"2025-12-20T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2025/cilium-dsr-hairpin-workaround/","title":"When BGP Doesn't Fix Hairpin: Cilium DSR and the Same-Node Problem"},{"content":" \u0026ldquo;The backup that only exists in one place doesn\u0026rsquo;t exist at all.\u0026rdquo;\nThe Problem: Barman\u0026rsquo;s Dirty Secret I had what I thought was a solid PostgreSQL backup strategy for my Immich database. CloudNativePG with Barman Cloud Plugin, two ObjectStores configured—one for Backblaze B2, one for Cloudflare R2. Daily ScheduledBackups for each destination. Belt and suspenders.\nThen I checked the actual buckets.\nB2: Full backups, WAL files, everything present. R2: Empty. Nothing. Not a single file.\nBoth ScheduledBackups were writing to B2. The R2 ObjectStore was configured, referenced in the ScheduledBackup—and completely ignored.\nTurns out, Barman Cloud Plugin has a limitation I hadn\u0026rsquo;t spotted. The barmanObjectName parameter in ScheduledBackup? It\u0026rsquo;s ignored. The plugin only uses whatever\u0026rsquo;s configured in the cluster\u0026rsquo;s plugin configuration. Both my scheduled backups were hitting the same destination.\nThis isn\u0026rsquo;t theoretical concern. If Backblaze goes down, I can\u0026rsquo;t copy backups to R2 because they\u0026rsquo;re only in B2. My 3-2-1 backup strategy was actually a 2-1-1.\nThe Alternative: pgBackRest After researching alternatives, I found three options for PostgreSQL backups with CloudNativePG plugins:\nBarman (what I was using) - Single destination limitation WAL-G - Similar architecture, no multi-repo support pgBackRest - Native multi-repository support pgBackRest from Dalibo looked promising. It\u0026rsquo;s designed from the ground up for multi-repository backups—WAL archiving goes to ALL configured repositories simultaneously. The catch: it\u0026rsquo;s experimental, has 16 GitHub stars, and the documentation is thin.\nI gave it a shot anyway.\nDeploying the pgBackRest Plugin Unlike Barman which has a Helm chart, the Dalibo pgBackRest plugin requires manual deployment. I created a new directory structure:\n1 2 3 4 5 6 7 kubernetes/apps/database/cloudnative-pg/pgbackrest/ ├── kustomization.yaml ├── crd.yaml # Repository CRD ├── rbac.yaml # ServiceAccount, ClusterRole, RoleBinding ├── certificate.yaml # Self-signed TLS for plugin communication ├── deployment.yaml # The controller └── service.yaml # Exposes the controller to CNPG The controller deployment is straightforward:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 apiVersion: apps/v1 kind: Deployment metadata: name: pgbackrest-controller namespace: database spec: replicas: 1 selector: matchLabels: app: pgbackrest-controller template: spec: containers: - name: pgbackrest-controller image: registry.hub.docker.com/dalibo/cnpg-pgbackrest-controller:latest args: - operator - --server-cert=/server/tls.crt - --server-key=/server/tls.key - --client-cert=/client/tls.crt - --server-address=:9090 - --log-level=debug env: - name: SIDECAR_IMAGE value: registry.hub.docker.com/dalibo/cnpg-pgbackrest-sidecar:latest volumeMounts: - mountPath: /server name: server - mountPath: /client name: client volumes: - name: server secret: secretName: pgbackrest-controller-server-tls - name: client secret: secretName: pgbackrest-controller-client-tls The Service Name Gotcha My first deployment crashed with a cryptic error:\n1 stanza creation failed: can\u0026#39;t parse pgbackrest JSON: invalid character \u0026#39;P\u0026#39; After way too much debugging, I found the issue. I\u0026rsquo;d named the service pgbackrest. Kubernetes automatically creates environment variables for services: PGBACKREST_SERVICE_HOST, PGBACKREST_PORT, etc.\npgBackRest interprets any PGBACKREST_* environment variable as configuration. The sidecar was trying to parse PGBACKREST_PORT_9090_TCP_ADDR as a pgBackRest option and choking on the JSON output.\nThe fix: rename the service to cnpg-pgbackrest. No more environment variable conflicts.\nLeader Lease Conflict After fixing the service name, the controller was stuck:\n1 attempting to acquire leader lease database/822e3f5c.cnpg.io... Both Barman and pgBackRest plugins use the same leader election lease name. I had to disable Barman first:\n1 2 flux suspend helmrelease barman-cloud -n database kubectl scale deployment barman-cloud-plugin-barman-cloud -n database --replicas=0 Once Barman released the lease, pgBackRest acquired it and started working.\nConfiguring Multi-Repository Backups The Repository CR is where the magic happens. You can define multiple S3 repositories and pgBackRest will archive WAL to all of them:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 apiVersion: pgbackrest.dalibo.com/v1 kind: Repository metadata: name: immich18-repository namespace: database spec: repoConfiguration: stanza: immich18 archive: async: true pushQueueMax: 1GiB s3Repositories: # Primary: Backblaze B2 - bucket: ${IMMICH_PG_BACKUP_B2_BUCKET} endpoint: s3.us-east-005.backblazeb2.com region: us-east-005 repoPath: /immich18 retentionPolicy: full: 14 fullType: count secretRef: accessKeyId: name: immich-cnpg-secret key: b2-access-key-id secretAccessKey: name: immich-cnpg-secret key: b2-secret-access-key # Secondary: Cloudflare R2 (use Flux variable substitution) - bucket: ${IMMICH_PG_BACKUP_R2_BUCKET} endpoint: ${CLOUDFLARE_ACCOUNT_ID}.r2.cloudflarestorage.com region: auto repoPath: /immich18 retentionPolicy: full: 14 fullType: count secretRef: accessKeyId: name: immich-cnpg-secret key: r2-access-key-id secretAccessKey: name: immich-cnpg-secret key: r2-secret-access-key Note the R2 bucket and endpoint use Flux variable substitution (${VARIABLE_NAME}) instead of hardcoded values. These get populated from an ExternalSecret that pulls from 1Password, with the Flux Kustomization configured to substitute variables from that secret. This keeps sensitive values like Cloudflare account IDs out of git history.\nThe cluster just needs to reference the plugin:\n1 2 3 4 5 spec: plugins: - name: pgbackrest.dalibo.com parameters: repositoryRef: immich18-repository The Backup Target Problem First backup attempt failed:\n1 2 ERROR: [056]: unable to find primary cluster - cannot proceed HINT: are all available clusters in recovery? CNPG defaults to running backups on replicas to reduce load on the primary. But pgBackRest can\u0026rsquo;t run backups from replicas without SSH access to the primary—which doesn\u0026rsquo;t exist in Kubernetes.\nThe fix is simple: tell CNPG to run backups on the primary:\n1 2 3 4 5 6 7 8 9 10 11 12 apiVersion: postgresql.cnpg.io/v1 kind: ScheduledBackup metadata: name: immich18-daily-b2 spec: schedule: \u0026#34;0 3 * * *\u0026#34; target: primary # This is the key method: plugin cluster: name: immich18 pluginConfiguration: name: pgbackrest.dalibo.com The Multi-Repo Full Backup Discovery With the target fixed, backups started working. WAL archiving was going to both repositories—I could see files appearing in both B2 and R2. But when I checked the full backups:\n1 2 3 4 5 6 7 # B2 aws s3 ls s3://\u0026lt;your-bucket\u0026gt;/immich18/ --profile backblaze-b2 --recursive | wc -l 1285 # R2 aws s3 ls s3://\u0026lt;your-bucket\u0026gt;/immich18/ --profile cloudflare-r2 --region auto --recursive | wc -l 11 WAL archives were in both. Full backup was only in B2.\nThis is actually intentional behavior in pgBackRest. From the documentation:\nWAL archiving: Pushes to ALL configured repositories simultaneously Full backups: Only runs against ONE repository (defaults to repo1) The reasoning makes sense—full backups are large and expensive. Doing them twice doubles storage costs. WAL goes everywhere for redundancy.\nBut for disaster recovery, I needed full backups in both locations.\nThe selectedRepository Parameter After digging through the plugin source code, I found the solution. The plugin accepts a selectedRepository parameter:\n1 2 3 4 selectedRepo, ok := request.Parameters[\u0026#34;selectedRepository\u0026#34;] if !ok { selectedRepo = \u0026#34;1\u0026#34; // use first repo by default } Note: the parameter is selectedRepository, not repo. The code defaults to repository 1 if not specified.\nTesting it:\n1 2 3 4 5 6 7 8 9 10 11 12 13 apiVersion: postgresql.cnpg.io/v1 kind: Backup metadata: name: pgbackrest-r2-test spec: cluster: name: immich18 method: plugin target: primary pluginConfiguration: name: pgbackrest.dalibo.com parameters: selectedRepository: \u0026#34;2\u0026#34; The logs confirmed it:\n1 {\u0026#34;msg\u0026#34;:\u0026#34;using repo\u0026#34;,\u0026#34;repo\u0026#34;:\u0026#34;PGBACKREST_REPO=2\u0026#34;} After waiting for the backup to complete:\n1 2 aws s3 ls s3://\u0026lt;your-bucket\u0026gt;/immich18/ --profile cloudflare-r2 --region auto --recursive | wc -l 1232 Full backup in R2.\nThe Final Configuration: Two ScheduledBackups To get true dual-destination full backups, I created two ScheduledBackups—one for each repository:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 --- apiVersion: postgresql.cnpg.io/v1 kind: ScheduledBackup metadata: name: immich18-daily-b2 namespace: database spec: schedule: \u0026#34;0 3 * * *\u0026#34; immediate: true backupOwnerReference: self method: plugin target: primary cluster: name: immich18 pluginConfiguration: name: pgbackrest.dalibo.com parameters: selectedRepository: \u0026#34;1\u0026#34; --- apiVersion: postgresql.cnpg.io/v1 kind: ScheduledBackup metadata: name: immich18-daily-r2 namespace: database spec: # Offset by 1 hour to avoid concurrent runs schedule: \u0026#34;0 4 * * *\u0026#34; immediate: false backupOwnerReference: self method: plugin target: primary cluster: name: immich18 pluginConfiguration: name: pgbackrest.dalibo.com parameters: selectedRepository: \u0026#34;2\u0026#34; Key details:\nSchedules are offset by 1 hour to avoid concurrent backup operations immediate: true only on the first one (don\u0026rsquo;t need two immediate backups) Both target the primary explicitly Verifying the Setup After deployment, both buckets showed full backups:\nBucket Files Full Backup B2 1285+ 20251217-034103F R2 1232+ 20251217-035928F WAL archiving continues to push to both repositories automatically. The Repository CR status shows the recovery window:\n1 2 3 4 5 6 7 8 status: recoveryWindow: firstBackup: label: 20251217-034103F type: full lastBackup: label: 20251217-035928F type: full Lessons Learned Assumption Reality Fix Barman supports multiple ObjectStores per ScheduledBackup barmanObjectName is ignored Switch to pgBackRest Service name pgbackrest is fine Creates conflicting PGBACKREST_* env vars Use cnpg-pgbackrest Plugins can share leader leases They use the same lease name Disable Barman first Backups run on any cluster member pgBackRest needs the primary Add target: primary Parameter name is repo It\u0026rsquo;s selectedRepository Read the source code pgBackRest multi-repo means full backups everywhere WAL yes, full backups no Create two ScheduledBackups Is pgBackRest Worth It? For the specific use case of multi-destination full backups, yes. The Dalibo plugin is experimental but functional. The key advantages over Barman:\nTrue WAL replication: WAL files go to all repositories simultaneously Explicit repository selection: You can target specific repos for full backups Better retention policies: Per-repository retention configuration The downsides:\nNo Helm chart (manual deployment required) Experimental status (16 stars on GitHub) Documentation is sparse (I had to read source code) For a homelab where I\u0026rsquo;m willing to debug issues, it\u0026rsquo;s the right choice. For production, I\u0026rsquo;d wait for the plugin to mature or implement external replication (rclone sync between buckets).\nPart 2: The Full Migration With pgBackRest working for Immich, I decided to go all-in and migrate my entire PostgreSQL infrastructure:\nRename immich18 to postgres18-immich - Clearer naming convention Create postgres18-cluster - New cluster to replace postgres17 (which used Barman) Migrate all 46 databases from postgres17 to postgres18-cluster Shared Credentials Rather than managing separate S3 credentials for each cluster, I consolidated to shared pgBackRest credentials:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: cnpg-secret namespace: database spec: secretStoreRef: kind: ClusterSecretStore name: onepassword-connect target: name: cnpg-secret data: # pgBackRest B2 - shared across all clusters - secretKey: b2-access-key-id remoteRef: key: backblaze property: BACKBLAZE_PGBACKREST_ACCESS_KEY - secretKey: b2-secret-access-key remoteRef: key: backblaze property: BACKBLAZE_PGBACKREST_SECRET_ACCESS_KEY # pgBackRest R2 - shared across all clusters - secretKey: r2-access-key-id remoteRef: key: cloudflare property: CLOUDFLARE_PGBACKREST_ACCESS_KEY - secretKey: r2-secret-access-key remoteRef: key: cloudflare property: CLOUDFLARE_PGBACKREST_SECRET_ACCESS_KEY Each cluster gets its own S3 bucket but shares the same credentials, simplifying secret management.\nCreating postgres18-cluster The new cluster uses standard CloudNativePG PostgreSQL 18 (no VectorChord needed - that\u0026rsquo;s only for Immich):\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 apiVersion: postgresql.cnpg.io/v1 kind: Cluster metadata: name: postgres18-cluster namespace: database spec: instances: 3 imageName: ghcr.io/cloudnative-pg/postgresql:18@sha256:7f374e054e46fdefd64b52904e32362949703a75c05302dca8ffa1eb78d41891 storage: size: 20Gi storageClass: openebs-hostpath postgresql: parameters: max_connections: \u0026#34;200\u0026#34; shared_buffers: 256MB wal_keep_size: 2GB plugins: - name: pgbackrest.dalibo.com parameters: repositoryRef: postgres18-cluster-repository bootstrap: initdb: database: postgres owner: postgres With its own Repository CR pointing to nerdz-postgres-cluster bucket:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 apiVersion: pgbackrest.dalibo.com/v1 kind: Repository metadata: name: postgres18-cluster-repository namespace: database spec: repoConfiguration: stanza: postgres18-cluster archive: async: true pushQueueMax: 1GiB s3Repositories: - bucket: nerdz-postgres-cluster endpoint: s3.us-east-005.backblazeb2.com region: us-east-005 repoPath: /postgres18-cluster retentionPolicy: full: 14 fullType: count secretRef: accessKeyId: name: cnpg-secret key: b2-access-key-id secretAccessKey: name: cnpg-secret key: b2-secret-access-key - bucket: nerdz-postgres-cluster endpoint: ${CLOUDFLARE_ACCOUNT_ID}.r2.cloudflarestorage.com region: auto repoPath: /postgres18-cluster retentionPolicy: full: 14 fullType: count secretRef: accessKeyId: name: cnpg-secret key: r2-access-key-id secretAccessKey: name: cnpg-secret key: r2-secret-access-key The Migration: pg_dump Direct Pipe With postgres18-cluster running and healthy, it was time to migrate 46 databases (~1.8GB total) from postgres17.\nStep 1: Scale down all apps\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # Downloads namespace (14 apps) kubectl scale deploy -n downloads autobrr bazarr bazarr-foreign bazarr-uhd \\ dashbrr prowlarr radarr radarr-uhd readarr sabnzbd sonarr sonarr-foreign \\ sonarr-uhd whisparr --replicas=0 # Home namespace kubectl scale deploy -n home atuin linkwarden manyfold paperless --replicas=0 # Home-automation namespace kubectl scale deploy -n home-automation n8n teslamate --replicas=0 # Other namespaces kubectl scale deploy -n games romm --replicas=0 kubectl scale deploy -n observability gatus --replicas=0 kubectl scale deploy -n plane plane-admin-wl plane-api-wl plane-beat-worker-wl \\ plane-live-wl plane-space-wl plane-worker-wl --replicas=0 kubectl scale deploy -n security pocket-id --replicas=0 Step 2: Verify no connections\n1 2 3 kubectl exec -n database postgres17-1 -c postgres -- psql -U postgres -c \\ \u0026#34;SELECT datname, usename, client_addr, state FROM pg_stat_activity \\ WHERE datname IS NOT NULL ORDER BY datname;\u0026#34; Only the query connection itself should appear.\nStep 3: Direct pipe migration\n1 2 kubectl exec -n database postgres17-1 -c postgres -- pg_dumpall -U postgres | \\ kubectl exec -i -n database postgres18-cluster-1 -c postgres -- psql -U postgres This pipes the dump directly between clusters - no intermediate file needed. The only \u0026ldquo;errors\u0026rdquo; are expected:\n1 2 ERROR: role \u0026#34;postgres\u0026#34; already exists ERROR: role \u0026#34;streaming_replica\u0026#34; already exists These roles already exist in the target cluster.\nStep 4: Verify the migration\n1 2 3 4 5 6 7 8 9 10 11 12 13 # Check database counts match kubectl exec -n database postgres17-1 -c postgres -- psql -U postgres -c \\ \u0026#34;SELECT COUNT(*) FROM pg_database WHERE datname NOT IN (\u0026#39;template0\u0026#39;, \u0026#39;template1\u0026#39;);\u0026#34; # Result: 46 kubectl exec -n database postgres18-cluster-1 -c postgres -- psql -U postgres -c \\ \u0026#34;SELECT COUNT(*) FROM pg_database WHERE datname NOT IN (\u0026#39;template0\u0026#39;, \u0026#39;template1\u0026#39;);\u0026#34; # Result: 46 # Spot check critical data kubectl exec -n database postgres18-cluster-1 -c postgres -- psql -U postgres -d teslamate -c \\ \u0026#34;SELECT COUNT(*) FROM positions;\u0026#34; # Result: 3,140,499 (matches source) Step 5: Update ExternalSecrets\nAll apps needed their database hostname updated from postgres17-rw.database.svc.cluster.local to postgres18-cluster-rw.database.svc.cluster.local:\n1 2 find kubernetes/apps -name \u0026#34;*.yaml\u0026#34; -exec grep -l \u0026#34;postgres17-rw\u0026#34; {} \\; | \\ xargs sed -i \u0026#39;s/postgres17-rw\\.database\\.svc\\.cluster\\.local/postgres18-cluster-rw.database.svc.cluster.local/g\u0026#39; Step 6: Scale apps back up\n1 2 3 4 kubectl scale deploy -n downloads autobrr bazarr bazarr-foreign bazarr-uhd \\ dashbrr prowlarr radarr radarr-uhd readarr sabnzbd sonarr sonarr-foreign \\ sonarr-uhd whisparr --replicas=1 # ... repeat for other namespaces Adding PgBouncer Connection Pooling For better connection management, I added PgBouncer poolers to both clusters:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 apiVersion: postgresql.cnpg.io/v1 kind: Pooler metadata: name: postgres18-cluster-pooler namespace: database spec: cluster: name: postgres18-cluster instances: 2 type: rw pgbouncer: poolMode: session parameters: max_client_conn: \u0026#34;500\u0026#34; default_pool_size: \u0026#34;100\u0026#34; Cleanup With everything migrated and verified:\n1 2 3 4 5 6 7 8 9 10 11 # Suspend the old Kustomization flux suspend kustomization cloudnative-pg-cluster17 -n database # Delete the cluster kubectl delete cluster postgres17 -n database # Remove from git rm -rf kubernetes/apps/database/cloudnative-pg/cluster17/ # Edit ks.yaml to remove the cluster17 Kustomization git add -A \u0026amp;\u0026amp; git commit -m \u0026#34;chore(database): remove postgres17 cluster\u0026#34; git push Final Architecture 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 kubernetes/apps/database/cloudnative-pg/ ├── app/ # CNPG operator + shared secrets ├── pgbackrest/ # pgBackRest controller ├── postgres18-immich/ # Immich cluster (VectorChord) │ ├── postgres18-immich.yaml │ ├── repository.yaml # B2: nerdz-postgres-immich │ ├── scheduledbackup.yaml # B2 @ 03:00, R2 @ 04:00 │ ├── pooler.yaml │ └── service.yaml └── postgres18-cluster/ # Main cluster (all other apps) ├── postgres18-cluster.yaml ├── repository.yaml # B2: nerdz-postgres-cluster ├── scheduledbackup.yaml # B2 @ 03:00, R2 @ 04:00 ├── pooler.yaml └── service.yaml Cluster Purpose Image Backups postgres18-immich Immich only VectorChord PG18 B2 + R2 via pgBackRest postgres18-cluster All other apps (46 DBs) Standard PG18 B2 + R2 via pgBackRest The Complete Lessons Learned Assumption Reality Fix Barman supports multiple ObjectStores per ScheduledBackup barmanObjectName is ignored Switch to pgBackRest Service name pgbackrest is fine Creates conflicting PGBACKREST_* env vars Use cnpg-pgbackrest Plugins can share leader leases They use the same lease name Disable Barman first Backups run on any cluster member pgBackRest needs the primary Add target: primary Parameter name is repo It\u0026rsquo;s selectedRepository Read the source code pgBackRest multi-repo means full backups everywhere WAL yes, full backups no Create two ScheduledBackups pg_dumpall needs intermediate storage Direct pipe works fine pg_dumpall | psql between pods CNPG clusters can scale to 0 Minimum is 1 instance Suspend Kustomization + delete cluster Is It Worth It? Absolutely. The migration took about 2 hours total:\nZero data loss: All 46 databases migrated with verified row counts Zero downtime (for the migration itself - apps were briefly stopped) True dual-destination backups: Both B2 and R2 now have full backups Cleaner architecture: Shared credentials, consistent naming, PgBouncer pooling PostgreSQL 18: Latest version with performance improvements The pgBackRest plugin is experimental, but it works. For a homelab, it\u0026rsquo;s the right choice.\nThis post documents part of the ongoing work on my home-ops repository. The pgBackRest plugin is from Dalibo.\n","date":"2025-12-19T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2025/pgbackrest-multi-destination-backups/","title":"pgBackRest: Multi-Destination PostgreSQL Backups in CloudNativePG"},{"content":" \u0026ldquo;L2 is fine until it isn\u0026rsquo;t. BGP is harder until it isn\u0026rsquo;t.\u0026rdquo; — Every network engineer eventually\nThe Problem: L2 Announcements on a Crowded Subnet My homelab has been running Cilium with L2 announcements for LoadBalancer IPs. It works — Cilium responds to ARP requests on behalf of the service IPs, and traffic flows. Simple.\nThe problem? All my LoadBalancer IPs lived on the same /24 as my nodes (10.90.3.0/24). With nodes, pods, services, and management devices all sharing airspace, I was running out of room. And L2 announcements have limitations:\nSubnet constraint: IPs must be on the same L2 segment as the announcing nodes ARP storms: High-traffic services can generate noisy ARP traffic No route aggregation: Each IP is independently announced via ARP Single point of failure: Only one node responds to ARP for a given IP I wanted a cleaner architecture: a dedicated Services VLAN for LoadBalancer IPs, advertised via BGP to my UDM Pro.\nThe Goal: BGP-Advertised LoadBalancer IPs on a Dedicated VLAN The target architecture:\nComponent Before After LoadBalancer IP range 10.90.3.200-210 (node subnet) 10.99.8.0/24 (Services VLAN) Advertisement method L2 (ARP) BGP Cluster ASN N/A 65010 Router ASN N/A 65001 (UDM Pro) With BGP, the cluster announces routes to my UDM Pro, which installs them in its routing table. Traffic to 10.99.8.x gets routed to whichever node is advertising that IP — no ARP required.\nHow BGP LoadBalancer Advertisement Works 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ┌─────────────────────────────┐ │ UDM Pro │ │ ASN 65001 │ │ 10.90.254.1 │ └──────────┬──────────────────┘ │ BGP Sessions (eBGP) │ ┌─────────────────────────┬──────────────────┼────────────────────┐ │ │ │ │ ▼ ▼ ▼ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ │ ASN 65010 │ │ ASN 65010 │ │ ASN 65010 │ │ │ 10.90.3.101 │ │ 10.90.3.102 │ │ 10.90.3.103 │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ Announces: │ Announces: │ Announces: │ │ 10.99.8.201 (envoy) │ 10.99.8.203 │ 10.99.8.205 │ │ 10.99.8.207 (dragonfly)│ (mosquitto) │ (qbittorrent) │ └────────────────────────┴──────────────────┴────────────────────┘ When client requests 10.99.8.201: 1. UDM Pro looks up route → next-hop is Node 1 2. Traffic routed directly to Node 1 3. Cilium delivers to envoy-gateway pod The key advantage: BGP routes are L3, so my LoadBalancer IPs can live on any subnet — they don\u0026rsquo;t need to be on the same broadcast domain as the nodes.\nThe Migration Step 1: Enable BGP Control Plane in Cilium First, enable the BGP control plane in Cilium\u0026rsquo;s Helm values:\n1 2 3 # kubernetes/apps/kube-system/cilium/app/helm-values.yaml bgpControlPlane: enabled: true This unlocks the new BGP CRDs but doesn\u0026rsquo;t configure any peering yet.\nStep 2: Create the BGP CRDs Cilium\u0026rsquo;s BGP implementation uses four CRDs:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 # kubernetes/apps/kube-system/cilium/app/networking.yaml --- # What to advertise apiVersion: cilium.io/v2 kind: CiliumBGPAdvertisement metadata: name: lb-services labels: advertise: bgp spec: advertisements: - advertisementType: \u0026#34;Service\u0026#34; service: addresses: - LoadBalancerIP # Advertise all LoadBalancer services unless explicitly excluded selector: matchExpressions: - key: io.cilium/bgp-announce operator: NotIn values: [\u0026#34;false\u0026#34;] --- # How to peer (timers, graceful restart) apiVersion: cilium.io/v2 kind: CiliumBGPPeerConfig metadata: name: udm-peer spec: timers: holdTimeSeconds: 90 keepAliveTimeSeconds: 30 gracefulRestart: enabled: true restartTimeSeconds: 120 families: - afi: ipv4 safi: unicast advertisements: matchLabels: advertise: bgp --- # Cluster-wide BGP configuration apiVersion: cilium.io/v2 kind: CiliumBGPClusterConfig metadata: name: bgp-cluster spec: nodeSelector: matchLabels: kubernetes.io/os: linux bgpInstances: - name: \u0026#34;home-cluster\u0026#34; localASN: 65010 peers: - name: \u0026#34;udm-pro\u0026#34; peerAddress: \u0026#34;10.90.254.1\u0026#34; peerASN: 65001 peerConfigRef: name: udm-peer --- # IP pool for LoadBalancer services apiVersion: cilium.io/v2 kind: CiliumLoadBalancerIPPool metadata: name: lb-pool spec: allowFirstLastIPs: \u0026#34;No\u0026#34; blocks: - cidr: \u0026#34;10.99.8.0/24\u0026#34; Step 3: Configure BGP on the UDM Pro On the UniFi side, I configured BGP in the Network settings:\nEnable BGP: Settings → Routing → BGP Local ASN: 65001 Add neighbor: 10.90.3.101-103 (each node), ASN 65010 Accept routes: Enable route acceptance from neighbors The UDM Pro now accepts route announcements from the cluster and installs them in its routing table.\nStep 4: Centralize LoadBalancer IP Variables Rather than scattering hardcoded IPs across HelmReleases, I centralized them in cluster-settings.yaml:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # kubernetes/components/common/cluster-vars/cluster-settings.yaml apiVersion: v1 kind: ConfigMap metadata: name: cluster-settings data: # LoadBalancer IPs (Services VLAN - 10.99.8.0/24) ENVOY_EXTERNAL_LBIP: \u0026#34;10.99.8.201\u0026#34; ENVOY_INTERNAL_LBIP: \u0026#34;10.99.8.202\u0026#34; MOSQUITTO_LBIP: \u0026#34;10.99.8.203\u0026#34; SMTP_RELAY_LBIP: \u0026#34;10.99.8.204\u0026#34; QBITTORRENT_LBIP: \u0026#34;10.99.8.205\u0026#34; PLEX_LBIP: \u0026#34;10.99.8.206\u0026#34; DRAGONFLY_LBIP: \u0026#34;10.99.8.207\u0026#34; JELLYFIN_LBIP: \u0026#34;10.99.8.208\u0026#34; NETWORK_UPS_TOOLS_LBIP: \u0026#34;10.99.8.209\u0026#34; GRAPHITE_EXPORTER_LBIP: \u0026#34;10.99.8.210\u0026#34; POSTGRES17_LBIP: \u0026#34;10.99.8.211\u0026#34; Services reference these via Flux variable substitution:\n1 2 3 4 5 6 7 8 9 10 11 12 # Example: dragonfly service apiVersion: v1 kind: Service metadata: name: dragonfly-lb annotations: io.cilium/lb-ipam-ips: ${DRAGONFLY_LBIP} spec: type: LoadBalancer ports: - name: dragonfly port: 6379 Info Variable naming: Flux\u0026rsquo;s envsubst doesn\u0026rsquo;t allow hyphens in variable names. Use underscores: DRAGONFLY_LBIP not DRAGONFLY-LBIP.\nStep 5: Remove L2 Announcement Configuration With BGP working, I removed the old L2 configs:\nCiliumL2AnnouncementPolicy — deleted CiliumLoadBalancerIPPool for old subnet — replaced with new 10.99.8.0/24 pool Step 6: Update DNS The internal gateway moved from 10.90.3.202 to 10.99.8.202. I updated the DNS A record for internal.nerdz.cloud in my UDM Pro\u0026rsquo;s local DNS settings (managed via external-dns-unifi, but this specific record needed a manual kick).\nPost-Migration Cleanup After the BGP migration, some pods had stale network state — they\u0026rsquo;d cached the old gateway IP or had connection pools pointing to old addresses. The fix was simple: restart everything.\n1 2 3 4 # Restart all deployments in affected namespaces for ns in downloads entertainment home home-automation games observability; do kubectl rollout restart deployment -n $ns done I also discovered an interesting side effect: Cilium performs TCP health checks on LoadBalancer services. My mosquitto logs filled with:\n1 Client \u0026lt;unknown\u0026gt; closed its connection. These are Cilium\u0026rsquo;s BGP control plane verifying the service is reachable — completely normal, not an error.\nVerifying BGP Sessions To confirm BGP is working, exec into a Cilium pod and check peer status:\n1 2 # Check BGP peering status from each node kubectl exec -n kube-system ds/cilium -- cilium-dbg bgp peers Output shows all three nodes with established sessions:\n1 2 3 4 5 6 7 8 9 10 11 === stanton-01 === Local AS Peer AS Peer Address Session Uptime Family Received Advertised 65010 65001 10.90.254.1:179 established 47m11s ipv4/unicast 11 12 === stanton-02 === Local AS Peer AS Peer Address Session Uptime Family Received Advertised 65010 65001 10.90.254.1:179 established 47m55s ipv4/unicast 11 12 === stanton-03 === Local AS Peer AS Peer Address Session Uptime Family Received Advertised 65010 65001 10.90.254.1:179 established 48m11s ipv4/unicast 11 12 To see what routes the cluster is advertising:\n1 kubectl exec -n kube-system ds/cilium -- cilium-dbg bgp routes advertised ipv4 unicast peer 10.90.254.1 1 2 3 4 5 6 7 8 9 10 11 12 VRouter Prefix NextHop Age Attrs 65010 10.99.8.201/32 10.90.3.101 51m26s [{Origin: i} {AsPath: 65010} {Nexthop: 10.90.3.101}] 65010 10.99.8.202/32 10.90.3.101 51m25s [{Origin: i} {AsPath: 65010} {Nexthop: 10.90.3.101}] 65010 10.99.8.203/32 10.90.3.101 51m31s [{Origin: i} {AsPath: 65010} {Nexthop: 10.90.3.101}] 65010 10.99.8.204/32 10.90.3.101 51m30s [{Origin: i} {AsPath: 65010} {Nexthop: 10.90.3.101}] 65010 10.99.8.205/32 10.90.3.101 51m11s [{Origin: i} {AsPath: 65010} {Nexthop: 10.90.3.101}] 65010 10.99.8.206/32 10.90.3.101 51m11s [{Origin: i} {AsPath: 65010} {Nexthop: 10.90.3.101}] 65010 10.99.8.207/32 10.90.3.101 51m32s [{Origin: i} {AsPath: 65010} {Nexthop: 10.90.3.101}] 65010 10.99.8.208/32 10.90.3.101 51m10s [{Origin: i} {AsPath: 65010} {Nexthop: 10.90.3.101}] 65010 10.99.8.209/32 10.90.3.101 51m26s [{Origin: i} {AsPath: 65010} {Nexthop: 10.90.3.101}] 65010 10.99.8.210/32 10.90.3.101 51m24s [{Origin: i} {AsPath: 65010} {Nexthop: 10.90.3.101}] 65010 10.99.8.211/32 10.90.3.101 51m32s [{Origin: i} {AsPath: 65010} {Nexthop: 10.90.3.101}] Verifying Routes on the UDM Pro The real proof is checking that the UDM Pro has actually installed these BGP routes. SSH into the UDM and check:\n1 ssh unifi \u0026#34;ip route show proto bgp\u0026#34; 1 2 3 4 5 6 7 8 9 10 11 12 13 10.99.8.201 metric 20 nexthop via 10.90.3.101 dev br0 weight 1 nexthop via 10.90.3.102 dev br0 weight 1 nexthop via 10.90.3.103 dev br0 weight 1 10.99.8.202 metric 20 nexthop via 10.90.3.101 dev br0 weight 1 nexthop via 10.90.3.102 dev br0 weight 1 nexthop via 10.90.3.103 dev br0 weight 1 10.99.8.203 metric 20 nexthop via 10.90.3.101 dev br0 weight 1 nexthop via 10.90.3.102 dev br0 weight 1 nexthop via 10.90.3.103 dev br0 weight 1 ... This output is the smoking gun:\nproto bgp — Routes were learned via BGP, not statically configured Multiple nexthops — ECMP (Equal-Cost Multi-Path) is active; each LoadBalancer IP has three paths (one per node) Equal weight — Traffic is load-balanced across all nodes The UDM Pro will distribute incoming traffic for any 10.99.8.x IP across all three cluster nodes. Cilium on each node then delivers the traffic to the correct pod.\nEnd-to-End Connectivity Test Finally, verify that services are actually reachable via their BGP-advertised IPs:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Test from inside the cluster kubectl run bgp-test --rm -it --restart=Never --image=busybox:1.36 -- \\ sh -c \u0026#34;nc -zv 10.99.8.207 6379 \u0026amp;\u0026amp; echo \u0026#39;Dragonfly reachable\u0026#39;\u0026#34; # Output: # 10.99.8.207 (10.99.8.207:6379) open # Dragonfly reachable kubectl run bgp-test --rm -it --restart=Never --image=busybox:1.36 -- \\ sh -c \u0026#34;nc -zv 10.99.8.211 5432 \u0026amp;\u0026amp; echo \u0026#39;Postgres reachable\u0026#39;\u0026#34; # Output: # 10.99.8.211 (10.99.8.211:5432) open # Postgres reachable Verification Summary Check Status Details BGP CRDs deployed ✅ All 4 CRDs present IP Pool ✅ 10.99.8.0/24 with 243 IPs available BGP session (stanton-01) ✅ Established, 12 routes advertised BGP session (stanton-02) ✅ Established, 12 routes advertised BGP session (stanton-03) ✅ Established, 12 routes advertised Routes in UDM ✅ 11 /32 routes with proto bgp ECMP enabled ✅ 3 nexthops per route L2 policies removed ✅ No CiliumL2AnnouncementPolicy found Service connectivity ✅ All LoadBalancer IPs reachable The Numbers Metric Before (L2) After (BGP) IP range 10.90.3.200-210 (shared subnet) 10.99.8.0/24 (dedicated VLAN) Advertisement method ARP BGP route announcements Routing resilience Single ARP responder Multiple BGP paths possible Configuration files Scattered hardcoded IPs Centralized in cluster-settings Subnet flexibility Must be on node L2 segment Any routable subnet Lessons Learned BGP is simpler than it looks — The four Cilium CRDs are straightforward once you understand the model: pool defines IPs, advertisement defines what to announce, peer config defines how, cluster config ties it together.\nCentralize your IPs — Having all LoadBalancer IPs in one ConfigMap makes changes easy and prevents the drift that comes from editing 15 different HelmReleases.\nRestart pods after network changes — Pods cache DNS and connection state. After changing IPs or network paths, restart affected workloads to pick up fresh state.\nDedicated subnets are worth it — Moving LoadBalancer IPs to their own VLAN provides clean separation and makes firewall rules simpler.\nBGP health checks are noisy — If you see mysterious connection attempts to your LoadBalancer services, it\u0026rsquo;s probably Cilium verifying reachability. Check your BGP config before debugging application issues.\nThe Gotcha: Flux substituteFrom One issue that bit me: the envoy-gateway Gateways showed null for their addresses. The problem was that the envoy-gateway-config Kustomization was missing postBuild.substituteFrom:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 # kubernetes/apps/network/envoy-gateway/ks.yaml apiVersion: kustomize.toolkit.fluxcd.io/v1 kind: Kustomization metadata: name: envoy-gateway-config namespace: network spec: # ... other config ... postBuild: substituteFrom: - kind: ConfigMap name: cluster-settings - kind: Secret name: cluster-secrets Without this, Flux doesn\u0026rsquo;t substitute variables like ${ENVOY_INTERNAL_LBIP}, and they render as literal null in the output. The parent Kustomization patches handle this for most resources, but envoy-gateway-config needed it explicitly because it\u0026rsquo;s a separate Kustomization with its own path.\nReferences Cilium BGP Control Plane — Official documentation CiliumBGPClusterConfig CRD — New v2 BGP API UniFi BGP Configuration — Setting up BGP on UDM Pro Update: DSR Hairpin Limitation After this migration, I discovered that while BGP fixes L2-level hairpin routing (where traffic couldn\u0026rsquo;t reach an IP \u0026ldquo;owned\u0026rdquo; by a different node via ARP), it doesn\u0026rsquo;t solve all hairpin scenarios.\nCilium\u0026rsquo;s DSR (Direct Server Return) mode has a same-node hairpin limitation: when a pod tries to reach a LoadBalancer VIP and the backend for that VIP is on the same node, traffic fails. The packet goes out to the router, comes back, and gets dropped.\nThe symptom: Pods intermittently timing out when reaching internal services — but only when the pod and the service backend happen to be scheduled on the same node.\nThe solution: CoreDNS rewriting to return ClusterIP instead of LoadBalancer VIP for pod-to-service traffic.\nRead the full story: When BGP Doesn\u0026rsquo;t Fix Hairpin: Cilium DSR and the Same-Node Problem\nThis post documents the BGP migration I performed on my home-ops repository. The centralized LBIP pattern and BGP configuration were refined through several iterations with help from Claude Code.\n","date":"2025-12-13T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2025/cilium-bgp-migration/","title":"From L2 Announcements to BGP: Migrating Cilium LoadBalancer IPs"},{"content":" \u0026ldquo;Your IDE\u0026rsquo;s schema validation is only as reliable as the endpoint serving it.\u0026rdquo;\nThe Problem: Schema Sprawl When I audited the yaml-language-server schema references across my home-ops repository, I found chaos:\n1 2 3 4 5 6 7 8 9 $ grep -rh \u0026#34;yaml-language-server.*\\$schema=\u0026#34; --include=\u0026#34;*.yaml\u0026#34; | \\ sed \u0026#39;s/.*\\$schema=//\u0026#39; | sort | uniq -c | sort -rn | head -10 148 https://json.schemastore.org/kustomization 79 https://kubernetes-schemas.pages.dev/... 49 https://kubernetes-schemas.pages.dev/... 21 https://lds-schemas.pages.dev/... 7 https://kubernetes-schemas.ok8.sh/... 5 https://kube-schemas.pages.dev/... 2 https://cluster-schemas.pages.dev/... 678 YAML files with schemas, pulling from six different sources. All external. All outside my control.\nThe problems:\nReliability: If any of these pages.dev sites go down, my IDE validation breaks Consistency: Different sources have slightly different schema versions Staleness: External schemas might not match my actual cluster CRDs Version drift: After upgrading a CRD, schemas lag until someone updates them The solution: extract schemas directly from my cluster and host them myself.\nThe Architecture 1 2 3 4 5 6 7 8 9 ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Talos Cluster │────▶│ GitHub Actions │────▶│ Cloudflare │ │ CRDs │ │ Runner (self- │ │ Pages │ │ │ │ hosted) │ │ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ │ │ ▼ ▼ ▼ kubectl get crds crd-extractor.sh kubernetes-schemas (from cluster) (JSON conversion) .nerdz.cloud The workflow:\nSelf-hosted runner has kubectl access to the cluster Daily cron job extracts all CRDs and converts to JSON Schema Schemas deploy to Cloudflare Pages at kubernetes-schemas.nerdz.cloud All YAML files reference the self-hosted endpoint Step 1: GitHub App for Actions Runner Controller The self-hosted runner needs to authenticate to GitHub. Following the onedr0p pattern, I created a GitHub App rather than using a PAT.\nCreating the App Go to GitHub Settings → Developer settings → GitHub Apps → New GitHub App\nConfigure:\nName: Nerdz-Action Runner Homepage URL: Your repo URL Webhook: Disable (unchecked) Permissions: Repository: Administration (Read and write), Metadata (Read-only) Where can this app be installed?: Only on this account After creation, note the App ID\nGenerate a Private Key (downloads a .pem file)\nInstall the app on your repository and note the Installation ID (from the URL)\nStoring Credentials in 1Password The private key is multi-line PEM, which 1Password doesn\u0026rsquo;t handle well in text fields. The workaround: base64 encode it.\n1 2 # Windows PowerShell [Convert]::ToBase64String([IO.File]::ReadAllBytes(\u0026#34;path\\to\\private-key.pem\u0026#34;)) Store in 1Password under a github-bots item:\nACTIONS_RUNNER_APP_ID: The App ID ACTIONS_RUNNER_INSTALLATION_ID: The Installation ID ACTIONS_RUNNER_PRIVATE_KEY: Base64-encoded private key Step 2: Deploy Actions Runner Controller The controller manages ephemeral runner pods that scale based on workflow demand.\nThe Controller HelmRelease 1 2 3 4 5 6 7 8 9 10 11 12 13 # kubernetes/apps/actions-runner-system/actions-runner-controller/app/helmrelease.yaml --- apiVersion: helm.toolkit.fluxcd.io/v2 kind: HelmRelease metadata: name: actions-runner-controller spec: chartRef: kind: OCIRepository name: gha-runner-scale-set-controller interval: 1h values: replicaCount: 1 The Runner Scale Set Each repository gets its own runner scale set. The key insight: the runner needs kubectl access to extract CRDs, so it gets a ServiceAccount with cluster-admin.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 # kubernetes/apps/actions-runner-system/actions-runner-controller/runners/home-ops/helmrelease.yaml --- apiVersion: helm.toolkit.fluxcd.io/v2 kind: HelmRelease metadata: name: \u0026amp;name home-ops-runner spec: chartRef: kind: OCIRepository name: gha-runner-scale-set values: githubConfigUrl: https://github.com/gavinmcfall/home-ops githubConfigSecret: home-ops-runner-secret minRunners: 1 maxRunners: 3 containerMode: type: kubernetes kubernetesModeWorkVolumeClaim: accessModes: [ReadWriteOnce] storageClassName: openebs-hostpath resources: requests: storage: 25Gi controllerServiceAccount: name: actions-runner-controller namespace: actions-runner-system template: spec: containers: - name: runner image: ghcr.io/home-operations/actions-runner:2.330.0 command: [/home/runner/run.sh] env: - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER value: \u0026#34;false\u0026#34; serviceAccountName: *name ExternalSecret with Base64 Decode The private key needs decoding from base64:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # kubernetes/apps/actions-runner-system/actions-runner-controller/runners/home-ops/externalsecret.yaml --- apiVersion: external-secrets.io/v1 kind: ExternalSecret metadata: name: home-ops-runner spec: secretStoreRef: kind: ClusterSecretStore name: onepassword-connect target: name: home-ops-runner-secret template: data: github_app_id: \u0026#34;{{ .ACTIONS_RUNNER_APP_ID }}\u0026#34; github_app_installation_id: \u0026#34;{{ .ACTIONS_RUNNER_INSTALLATION_ID }}\u0026#34; github_app_private_key: \u0026#34;{{ .ACTIONS_RUNNER_PRIVATE_KEY | b64dec }}\u0026#34; dataFrom: - extract: key: github-bots The | b64dec template function handles the base64 decoding.\nRBAC for kubectl Access 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 # kubernetes/apps/actions-runner-system/actions-runner-controller/runners/home-ops/rbac.yaml --- apiVersion: v1 kind: ServiceAccount metadata: name: home-ops-runner --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: home-ops-runner roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: cluster-admin subjects: - kind: ServiceAccount name: home-ops-runner namespace: actions-runner-system Step 3: The Schemas Workflow The workflow runs daily, extracts CRDs, and deploys to Cloudflare Pages.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 # .github/workflows/schemas.yaml --- name: Schemas on: workflow_dispatch: schedule: - cron: 0 0 * * * push: branches: [main] paths: - .github/workflows/schemas.yaml - .github/schemas-index.html jobs: main: name: Schemas runs-on: home-ops-runner steps: - name: Checkout uses: actions/checkout@v6 - name: Install kubectl uses: azure/setup-kubectl@v4 - name: Setup Python uses: actions/setup-python@v6 with: python-version: 3.14.x - name: Install Python Dependencies run: pip install pyyaml - name: Run crd-extractor run: | curl -fsSL https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/Utilities/crd-extractor.sh | bash - name: Generate index.html run: | cd /home/runner/.datree/crdSchemas # ... generate browsable index ... - name: Publish Schemas uses: cloudflare/wrangler-action@v3 with: apiToken: ${{ secrets.CLOUDFLARE_API_TOKEN }} accountId: ${{ secrets.CLOUDFLARE_ACCOUNT_ID }} workingDirectory: /home/runner/.datree/crdSchemas command: pages deploy --project-name=kubernetes-schemas --branch main . The datreeio/CRDs-catalog crd-extractor script handles the heavy lifting:\nRuns kubectl get crds -o yaml Converts OpenAPI v3 schemas to JSON Schema format Organizes by API group (helm.toolkit.fluxcd.io/, external-secrets.io/, etc.) Step 4: Cloudflare Pages Setup Cloudflare Pages is deprecated in favor of Workers, but still works for static hosting.\nCreate a Pages project named kubernetes-schemas Add custom domain kubernetes-schemas.nerdz.cloud Add GitHub secrets: CLOUDFLARE_API_TOKEN (with Pages:Edit permission) CLOUDFLARE_ACCOUNT_ID The first workflow run populates the site. After that, it updates daily.\nThe Index Page I added a styled index.html that makes the schemas browsable:\n1 2 3 4 5 6 7 8 9 kubernetes-schemas.nerdz.cloud/ ├── index.html # Searchable UI ├── helm.toolkit.fluxcd.io/ │ ├── helmrelease_v2.json │ └── helmrelease_v2beta2.json ├── external-secrets.io/ │ ├── externalsecret_v1.json │ └── clustersecretstore_v1.json └── ... 34 API groups total The UI shows stats (API groups, schema count, last update) and lets you search/filter.\nStep 5: Migrate All YAML Files With schemas hosted, I migrated 357 files:\n1 2 3 4 5 6 7 8 # Replace all external schema sources find kubernetes/ -name \u0026#34;*.yaml\u0026#34; -exec sed -i \\ \u0026#39;s|https://kubernetes-schemas.pages.dev/|https://kubernetes-schemas.nerdz.cloud/|g\u0026#39; {} + find kubernetes/ -name \u0026#34;*.yaml\u0026#34; -exec sed -i \\ \u0026#39;s|https://lds-schemas.pages.dev/|https://kubernetes-schemas.nerdz.cloud/|g\u0026#39; {} + # ... repeat for ok8.sh, kube-schemas.pages.dev, cluster-schemas.pages.dev I also added schemas to ~75 files that were missing them entirely.\nSchema Version Mismatches After migration, my IDE showed warnings on many files. The cause: schema URLs didn\u0026rsquo;t match apiVersions.\n1 2 3 # Wrong - schema says v1beta1 but apiVersion is v1 # yaml-language-server: $schema=https://kubernetes-schemas.nerdz.cloud/external-secrets.io/externalsecret_v1beta1.json apiVersion: external-secrets.io/v1 Fixed 60 files with version mismatches:\nexternalsecret_v1beta1 → externalsecret_v1 (51 files) clustersecretstore_v1beta1 → clustersecretstore_v1 (1 file) helmrepository_v1beta2 → helmrepository_v1 (8 files) What Didn\u0026rsquo;t Work: Flux Variable Patterns One schema validation error I couldn\u0026rsquo;t fix cleanly:\n1 2 HTTPRoute unifi is invalid: at \u0026#39;/spec/hostnames/0\u0026#39;: \u0026#39;unifi.${SECRET_DOMAIN}\u0026#39; does not match pattern \u0026#39;^(\\*\\.)?[a-z0-9]...\u0026#39; The Flux variable ${SECRET_DOMAIN} gets substituted at reconciliation time, but the schema validator sees the literal string and fails the hostname pattern.\nOptions considered:\nPatch schemas to allow Flux patterns - Over-permissive, masks real errors Accept the warnings - Harmless, Flux still works Remove schemas from files with variables - Loses validation I went with option 2. The warnings are cosmetic—kubeconform passes, Flux reconciles correctly, and the IDE just shows a squiggle on variable-heavy files.\nThe End Result Before:\n6 different external schema sources No control over availability or freshness Schema versions lagging behind CRD upgrades After:\nSingle self-hosted endpoint: kubernetes-schemas.nerdz.cloud Schemas extracted daily from actual cluster CRDs Browsable index with search Full control over the infrastructure 1 2 3 $ curl -sI https://kubernetes-schemas.nerdz.cloud/helm.toolkit.fluxcd.io/helmrelease_v2.json HTTP/2 200 content-type: application/json The schemas update automatically when I upgrade CRDs. No more waiting for upstream schema repos to catch up.\nCosts Cloudflare Pages: Free tier GitHub Actions: Free for public repos (self-hosted runner avoids minute limits anyway) Complexity: One more thing to maintain, but it\u0026rsquo;s fully GitOps-managed Summary Component Purpose GitHub App Authentication for self-hosted runner Actions Runner Controller Manages ephemeral runner pods home-ops-runner Runner scale set with kubectl access crd-extractor.sh Converts CRDs to JSON Schema Cloudflare Pages Hosts schemas at custom domain schemas-index.html Browsable UI for schema discovery The self-hosted runner pattern enables more than just schemas—any workflow that needs cluster access (integration tests, deployments, monitoring) can now run on infrastructure I control.\nThis post documents work on my home-ops repository. The runner pattern is adapted from onedr0p\u0026rsquo;s home-ops and the broader Kubernetes@Home community.\n","date":"2025-12-13T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2025/self-hosted-kubernetes-schemas/","title":"Self-Hosting Kubernetes CRD Schemas"},{"content":" \u0026ldquo;Your 100MB of data living in a 250MB file is not a feature.\u0026rdquo;\nThe Alert Got this from AlertManager:\n1 etcd cluster \u0026#34;kube-etcd\u0026#34;: database size in use on instance 10.90.3.101:2381 is 46.35% of the actual allocated disk space, please run defragmentation (e.g. etcdctl defrag) to retrieve the unused fragmented disk space. Time to learn what etcd fragmentation actually means.\nWhy Does etcd Fragment? etcd stores all Kubernetes state—every pod, deployment, secret, and configmap lives here. Under the hood, it uses a B+ tree data structure backed by an append-only write-ahead log (WAL).\nHere\u0026rsquo;s the key insight: when you delete or update a key in etcd, the old data isn\u0026rsquo;t removed from disk. It\u0026rsquo;s just marked as free space. The database file grows but never shrinks on its own.\nThree things constantly churn etcd in a Kubernetes cluster:\nPod scheduling: Every pod creation, update, and deletion writes to etcd Controller loops: Controllers constantly reconciling state means constant writes Lease renewals: Kubelet heartbeats, leader elections, and endpoint updates The Kubernetes API server runs automatic compaction, which removes old revisions of keys (you don\u0026rsquo;t need 1000 historical versions of a ConfigMap). But compaction just marks space as reusable—it doesn\u0026rsquo;t actually free it.\nOver time, your database file becomes Swiss cheese: actual data scattered among holes of freed space. This is fragmentation.\nChecking the Damage In Talos, checking etcd status is straightforward:\n1 2 3 4 5 6 $ talosctl -n 10.90.3.101,10.90.3.102,10.90.3.103 etcd status NODE MEMBER DB SIZE IN USE LEADER 10.90.3.101 4b0c33136dd72672 259 MB 119 MB (45.93%) f0f9525a77920d83 10.90.3.102 f0f9525a77920d83 262 MB 119 MB (45.28%) f0f9525a77920d83 10.90.3.103 cd15b67489885c40 267 MB 119 MB (44.49%) f0f9525a77920d83 All three control plane nodes were using less than 50% of their allocated space. The rest? Fragmented free space doing nothing but wasting disk I/O.\nThe Fix Defragmentation rewrites the database file compactly, eliminating the holes. In Talos, it\u0026rsquo;s a single command.\nStep 1: Snapshot First Paranoia is healthy when touching cluster state:\n1 talosctl -n 10.90.3.101 etcd snapshot /tmp/etcd-backup-$(date +%Y%m%d).snapshot This creates a consistent backup you can restore from if something goes wrong.\nStep 2: Defrag Each Node Sequentially Defrag briefly blocks reads and writes on that node, so you want to do one at a time. Best practice: non-leader nodes first, leader last.\nTo find the leader, look at the LEADER column in the status output. It shows the member ID of the current leader (f0f9525a77920d83). Then match that to the MEMBER column to find which node it is:\n1 2 3 4 5 6 $ talosctl -n 10.90.3.101,10.90.3.102,10.90.3.103 etcd status NODE MEMBER DB SIZE IN USE LEADER 10.90.3.101 4b0c33136dd72672 259 MB 119 MB (45.93%) f0f9525a77920d83 10.90.3.102 f0f9525a77920d83 262 MB 119 MB (45.28%) f0f9525a77920d83 \u0026lt;-- MEMBER matches LEADER 10.90.3.103 cd15b67489885c40 267 MB 119 MB (44.49%) f0f9525a77920d83 Node 10.90.3.102 has member ID f0f9525a77920d83, which matches the leader ID. So that\u0026rsquo;s our leader.\nNow defrag:\n1 2 3 4 5 6 # Non-leaders first talosctl -n 10.90.3.101 etcd defrag talosctl -n 10.90.3.103 etcd defrag # Leader last talosctl -n 10.90.3.102 etcd defrag Each defrag takes just a few seconds for a database this size.\nStep 3: Verify 1 2 3 4 5 6 $ talosctl -n 10.90.3.101,10.90.3.102,10.90.3.103 etcd status NODE DB SIZE IN USE LEADER 10.90.3.101 104 MB 104 MB (100.00%) f0f9525a77920d83 10.90.3.102 104 MB 104 MB (100.00%) f0f9525a77920d83 10.90.3.103 104 MB 104 MB (100.00%) f0f9525a77920d83 ~160MB reclaimed per node. 100% utilization means zero fragmentation.\nWhen to Defrag The general guidance:\nBelow 50% utilization: Defrag recommended NOSPACE errors: Defrag required (etcd will refuse writes when it hits its quota) For homelabs, waiting for the Prometheus alert is fine. Production clusters might schedule it monthly or watch the metric more closely.\nWhat If You Hit NOSPACE? If etcd hits its space quota before you defrag, it enters a read-only mode to protect data integrity. You\u0026rsquo;ll need to:\nDefrag to free space Clear the alarm: talosctl etcd alarm disarm If you\u0026rsquo;re consistently hitting the quota, you can increase it in your Talos machine config:\n1 2 3 4 cluster: etcd: extraArgs: quota-backend-bytes: \u0026#34;4294967296\u0026#34; # 4 GiB References Talos etcd Maintenance etcd Maintenance Documentation ","date":"2025-12-11T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2025/etcd-defragmentation/","title":"Defragmenting etcd in a Talos Kubernetes Cluster"},{"content":" \u0026ldquo;The best backup strategy is the one you can actually verify and restore from.\u0026rdquo;\nWhy Move from Restic to Kopia? My original Volsync setup used Restic to back up PVCs directly to Backblaze B2. It worked, but had some pain points:\nNo visibility: Restic repositories are opaque. You can\u0026rsquo;t browse them without CLI tools. Slow restores: Every restore required downloading from S3, which is slow and costs egress fees. No deduplication across apps: Each app had its own Restic repository with no shared deduplication. Kopia solves all of these:\nWeb UI: Kopia has a built-in web interface to browse snapshots, verify integrity, and trigger restores. Local NFS repository: Backups go to NFS first (fast restores), then sync to cloud storage. Global deduplication: A single Kopia repository deduplicates across all PVCs. The pattern I\u0026rsquo;m following comes from the home-operations community—specifically Devin (onedr0p), Jory (joryirving), and Kashall\u0026rsquo;s homelab repos.\nArchitecture Overview 1 2 3 4 5 6 7 8 9 10 11 ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ App PVC │────▶│ Volsync │────▶│ Kopia Server │ │ (source) │ │ ReplicationSrc │ │ (NFS backend) │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ ▼ ┌─────────────────┐ │ NFS Share │ │ citadel:/mnt/ │ │ VolsyncKopia │ └─────────────────┘ The key insight: Volsync mover pods need access to the NFS share where Kopia stores its repository. Instead of configuring NFS mounts in every ReplicationSource, we use MutatingAdmissionPolicy to automatically inject the NFS volume into any pod with specific labels.\nPrerequisites Before starting, I needed:\nNFS share on my NAS (citadel.internal) at /mnt/storage0/backups/VolsyncKopia 1Password item named kopia with a KOPIA_PASSWORD field Kubernetes 1.33+ for MutatingAdmissionPolicy support Step 1: Enable MutatingAdmissionPolicy Feature Gate MutatingAdmissionPolicy is an alpha feature in Kubernetes 1.33. To enable it on Talos, I added a controller patch:\n1 2 3 4 5 6 # kubernetes/bootstrap/talos/patches/controller/feature-gates.yaml cluster: apiServer: extraArgs: feature-gates: ImageVolume=true,MutatingAdmissionPolicy=true runtime-config: admissionregistration.k8s.io/v1alpha1=true The v1beta1 vs v1alpha1 Gotcha My first attempt used v1beta1 because that\u0026rsquo;s what the documentation suggested. Wrong. Kubernetes 1.33 only supports v1alpha1—v1beta1 arrives in Kubernetes 1.34.\nAfter applying the feature gate and seeing the API still wasn\u0026rsquo;t available, I had to:\nChange runtime-config from v1beta1 to v1alpha1 Update all MutatingAdmissionPolicy manifests from v1beta1 to v1alpha1 Reapply the Talos config to all three nodes Lesson learned: Always verify API versions before implementing:\n1 2 kubectl api-resources --api-group=admissionregistration.k8s.io kubectl api-versions | grep admission Rolling Out Talos Changes Safely I applied the changes one node at a time to minimize risk:\n1 2 3 4 5 6 7 8 9 10 # Dry-run first talosctl apply-config -n 10.90.3.101 \\ -f clusterconfig/home-kubernetes-stanton-01.yaml --dry-run # Apply to first node, wait for Ready talosctl apply-config -n 10.90.3.101 \\ -f clusterconfig/home-kubernetes-stanton-01.yaml kubectl get nodes -w # Repeat for remaining nodes Step 2: Create the MutatingAdmissionPolicy The policy automatically injects an NFS volume into Volsync mover pods:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 # kubernetes/apps/storage/volsync/app/mutatingadmissionpolicy.yaml --- apiVersion: admissionregistration.k8s.io/v1alpha1 kind: MutatingAdmissionPolicy metadata: name: volsync-mover-nfs spec: failurePolicy: Fail matchConstraints: resourceRules: - apiGroups: [\u0026#34;\u0026#34;] apiVersions: [\u0026#34;v1\u0026#34;] resources: [\u0026#34;pods\u0026#34;] operations: [\u0026#34;CREATE\u0026#34;] matchConditions: - name: is-volsync-mover expression: \u0026#34;has(object.metadata.labels) \u0026amp;\u0026amp; \u0026#39;volsync.backube/mover\u0026#39; in object.metadata.labels\u0026#34; mutations: - patchType: ApplyConfiguration applyConfiguration: expression: | Object{ spec: Object.spec{ volumes: [ Object{ name: \u0026#34;kopia-repository\u0026#34;, nfs: Object{ server: \u0026#34;citadel.internal\u0026#34;, path: \u0026#34;/mnt/storage0/backups/VolsyncKopia\u0026#34; } } ], containers: object.spec.containers.map(c, Object{ name: c.name, volumeMounts: [ Object{ name: \u0026#34;kopia-repository\u0026#34;, mountPath: \u0026#34;/repository\u0026#34; } ] } ) } } This policy:\nMatches any pod with the label volsync.backube/mover Injects an NFS volume pointing to the Kopia repository Mounts it at /repository in all containers I also added a jitter policy to prevent all backup jobs from running simultaneously.\nStep 3: Deploy the Kopia Server The Kopia server provides a Web UI for browsing and managing backups:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 # kubernetes/apps/storage/kopia/app/helmrelease.yaml --- apiVersion: helm.toolkit.fluxcd.io/v2 kind: HelmRelease metadata: name: kopia spec: chartRef: kind: OCIRepository name: kopia values: controllers: kopia: containers: app: image: repository: ghcr.io/home-operations/kopia tag: 0.22.3@sha256:eeebd12fd4b3a9c25b9f711fff32454f62e2d5e2d431ab6806ad21c52f414807 env: KOPIA_WEB_ENABLED: true KOPIA_WEB_PORT: \u0026amp;port 80 TZ: ${TIMEZONE} envFrom: - secretRef: name: kopia-secret command: - /bin/sh - -c - | export HOME=/tmp export USER=kopia # Initialize repository if it doesn\u0026#39;t exist if ! kopia repository connect filesystem --path=/repository 2\u0026gt;/dev/null; then echo \u0026#34;Initializing new Kopia repository...\u0026#34; kopia repository create filesystem --path=/repository fi # Start the server (disable CSRF for reverse proxy compatibility) exec kopia server start --address=0.0.0.0:80 --without-password --insecure --disable-csrf-token-checks defaultPodOptions: securityContext: runAsNonRoot: true runAsUser: 568 runAsGroup: 568 fsGroup: 568 fsGroupChangePolicy: OnRootMismatch route: app: annotations: internal-dns.alpha.kubernetes.io/target: internal.${SECRET_DOMAIN} hostnames: - \u0026#34;{{ .Release.Name }}.${SECRET_DOMAIN}\u0026#34; parentRefs: - name: internal namespace: network sectionName: https persistence: config-file: type: configMap identifier: config globalMounts: - path: /config/repository.config subPath: repository.config repository: type: nfs server: citadel.internal path: /mnt/storage0/backups/VolsyncKopia globalMounts: - path: /repository Mistakes I Made 1. Repository initialization: My first deployment crashed because the NFS path was empty—no Kopia repository existed. The startup script now auto-initializes if needed.\n2. KOPIA_PASSWORD handling: I initially tried passing --password as a flag, which expects interactive input. The fix: rely on the KOPIA_PASSWORD environment variable being read automatically.\n3. HOME and USER environment variables: The non-root container couldn\u0026rsquo;t determine the current user. Adding export HOME=/tmp and export USER=kopia fixed the permission errors.\n4. ConfigMap naming: app-template creates ConfigMaps using the release name (kopia), not a custom suffix. I had to change from name: kopia-config to identifier: config to reference the chart-defined ConfigMap correctly.\n5. CSRF token errors behind reverse proxy: When accessing the Kopia Web UI through Gateway API, I got \u0026ldquo;invalid CSRF token\u0026rdquo; errors flooding the logs. The fix: add --disable-csrf-token-checks to the server start command. This is safe for internal services behind a reverse proxy.\n6. Gateway naming conventions: My first attempt used envoy-internal as the gateway name. Wrong—the gateways are just named internal and external in the network namespace. Also forgot the sectionName: https.\n7. Missing DNS annotation: Routes need internal-dns.alpha.kubernetes.io/target: internal.${SECRET_DOMAIN} for internal DNS to create records. Without this, the hostname doesn\u0026rsquo;t resolve.\n8. Hardcoded values: Used Pacific/Auckland instead of ${TIMEZONE} and kopia.${SECRET_DOMAIN} instead of \u0026quot;{{ .Release.Name }}.${SECRET_DOMAIN}}\u0026quot;. These should use variables for consistency.\n9. Wrong UID/GID: Initially used 1000 for the security context. The standard in my cluster is 568 for the apps user/group. This matters for NFS share permissions.\nStep 4: Create the Volsync Components To avoid repeating the same configuration across every app, I created reusable Kustomize components. But I went further than just local NFS—I wanted a proper 3-2-1 backup strategy:\n3 copies of data (local PVC + NFS + cloud) 2 different storage types (Ceph block + NFS + S3) 1 offsite copy (cloud) The Multi-Destination Architecture 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 kubernetes/components/volsync/ ├── kustomization.yaml # Includes all 3 (most common) ├── nfs-truenas/ # Local NFS - hourly, with restore │ ├── externalsecret.yaml │ ├── kustomization.yaml │ ├── pvc.yaml │ ├── replicationdestination.yaml │ └── replicationsource.yaml ├── s3-backblaze/ # Backblaze B2 - daily DR │ ├── externalsecret.yaml │ ├── kustomization.yaml │ └── replicationsource.yaml └── s3-cloudflare/ # Cloudflare R2 - daily DR ├── externalsecret.yaml ├── kustomization.yaml └── replicationsource.yaml Key insight: The cloud components (B2/R2) don\u0026rsquo;t need PVC or ReplicationDestination. Restores happen from local NFS first (faster). Cloud backups are for disaster recovery only.\nThe Root Component The root kustomization.yaml includes all three destinations:\n1 2 3 4 5 6 7 8 # kubernetes/components/volsync/kustomization.yaml --- apiVersion: kustomize.config.k8s.io/v1alpha1 kind: Component components: - ./nfs-truenas - ./s3-backblaze - ./s3-cloudflare This means most apps just need:\n1 2 components: - ../../../../components/volsync # Gets all 3 destinations If you only want specific destinations, reference them directly:\n1 2 components: - ../../../../components/volsync/nfs-truenas # Just local NFS The NFS Component (Primary) The NFS component has the PVC and ReplicationDestination for restores:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # kubernetes/components/volsync/nfs-truenas/externalsecret.yaml --- apiVersion: external-secrets.io/v1 kind: ExternalSecret metadata: name: \u0026#34;${APP}-volsync\u0026#34; spec: secretStoreRef: kind: ClusterSecretStore name: onepassword-connect target: name: \u0026#34;${APP}-volsync-secret\u0026#34; template: data: KOPIA_FS_PATH: /repository KOPIA_PASSWORD: \u0026#34;{{ .KOPIA_PASSWORD }}\u0026#34; KOPIA_REPOSITORY: filesystem:///repository dataFrom: - extract: key: kopia The ReplicationSource backs up hourly:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # kubernetes/components/volsync/nfs-truenas/replicationsource.yaml --- apiVersion: volsync.backube/v1alpha1 kind: ReplicationSource metadata: name: ${APP} spec: sourcePVC: ${VOLSYNC_CLAIM:=${APP}} trigger: schedule: \u0026#34;0 * * * *\u0026#34; # Hourly kopia: compression: zstd-fastest copyMethod: Snapshot moverSecurityContext: runAsUser: ${VOLSYNC_UID:=568} runAsGroup: ${VOLSYNC_GID:=568} fsGroup: ${VOLSYNC_GID:=568} repository: ${APP}-volsync-secret retain: hourly: 24 daily: 7 The Cloud Components (Disaster Recovery) The Backblaze B2 component uses Kopia\u0026rsquo;s S3-compatible backend:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 # kubernetes/components/volsync/s3-backblaze/externalsecret.yaml --- apiVersion: external-secrets.io/v1 kind: ExternalSecret metadata: name: \u0026#34;${APP}-volsync-b2\u0026#34; spec: secretStoreRef: kind: ClusterSecretStore name: onepassword-connect target: name: \u0026#34;${APP}-volsync-b2-secret\u0026#34; template: data: KOPIA_PASSWORD: \u0026#34;{{ .KOPIA_PASSWORD }}\u0026#34; KOPIA_S3_BUCKET: \u0026#34;{{ .VOLSYNC_KOPIA_B2_BUCKET }}\u0026#34; KOPIA_S3_ENDPOINT: \u0026#34;s3.us-east-005.backblazeb2.com\u0026#34; AWS_ACCESS_KEY_ID: \u0026#34;{{ .VOLSYNC_KOPIA_B2_ACCESS_KEY }}\u0026#34; AWS_SECRET_ACCESS_KEY: \u0026#34;{{ .VOLSYNC_KOPIA_B2_SECRET_ACCESS_KEY }}\u0026#34; KOPIA_REPOSITORY: \u0026#34;s3://{{ .VOLSYNC_KOPIA_B2_BUCKET }}/${APP}/\u0026#34; dataFrom: - extract: key: kopia - extract: key: backblaze The ReplicationSource backs up daily and keeps 14 days:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # kubernetes/components/volsync/s3-backblaze/replicationsource.yaml --- apiVersion: volsync.backube/v1alpha1 kind: ReplicationSource metadata: name: ${APP}-b2 spec: sourcePVC: ${VOLSYNC_CLAIM:=${APP}} trigger: schedule: \u0026#34;0 0 * * *\u0026#34; # Daily at midnight kopia: compression: zstd-fastest copyMethod: Snapshot moverSecurityContext: runAsUser: ${VOLSYNC_UID:=568} runAsGroup: ${VOLSYNC_GID:=568} fsGroup: ${VOLSYNC_GID:=568} repository: ${APP}-volsync-b2-secret retain: daily: 14 Cloudflare R2 follows the same pattern, with the endpoint constructed from the account ID:\n1 KOPIA_S3_ENDPOINT: \u0026#34;{{ .CLOUDFLARE_ACCOUNT_ID }}.r2.cloudflarestorage.com\u0026#34; Why Kopia for Cloud Too? You might wonder why I didn\u0026rsquo;t stick with Restic for cloud backups. The answer: Restic lock issues. Restic repositories can get stuck with stale locks, requiring manual intervention with restic unlock. Kopia handles concurrent access better and doesn\u0026rsquo;t have this problem.\nThe perfectra1n fork of Volsync (ghcr.io/perfectra1n/volsync) supports Kopia\u0026rsquo;s S3 backend via environment variables:\nKOPIA_S3_BUCKET - bucket name KOPIA_S3_ENDPOINT - S3-compatible endpoint AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY - credentials KOPIA_REPOSITORY - full repository URL (s3://bucket/path/) Another Gotcha: ClusterSecretStore Name I assumed the ClusterSecretStore was named onepassword. It\u0026rsquo;s actually onepassword-connect. Always verify existing resource names:\n1 kubectl get clustersecretstore Step 5: Migrate an App (The Hard Way) Migrating an existing app with data turned out to be more complex than expected. The Kopia volsync component expects PVCs named ${APP} (e.g., romm), but my existing app used romm-data. Here\u0026rsquo;s the approach that worked:\nThe Problem: PVC Name Mismatch My romm app used a PVC named romm-data, but the volsync component creates resources expecting PVC name ${APP} (romm). I tried several approaches that failed:\nUsing VOLSYNC_CLAIM variable - The component\u0026rsquo;s PVC template still created a conflicting romm PVC Patching the dataSourceRef - PVC specs are immutable after creation Snapshotting a terminating PVC - Can\u0026rsquo;t add finalizers to a PVC marked for deletion The Approach That Worked Step 1: Keep existing backups running\nDon\u0026rsquo;t switch to Kopia immediately. Keep the old Restic-based volsync template running so you have cloud backups.\nStep 2: Rename the PVC via snapshot\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 # Scale down the app flux suspend kustomization romm -n games kubectl scale deploy romm -n games --replicas=0 # Create a snapshot of the existing PVC kubectl apply -f - \u0026lt;\u0026lt;EOF apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshot metadata: name: romm-data-migration namespace: games spec: volumeSnapshotClassName: csi-ceph-block source: persistentVolumeClaimName: romm-data EOF # Wait for snapshot to be ready kubectl get volumesnapshot romm-data-migration -n games # Create new PVC with correct name from snapshot kubectl apply -f - \u0026lt;\u0026lt;EOF apiVersion: v1 kind: PersistentVolumeClaim metadata: name: romm namespace: games spec: accessModes: [ReadWriteOnce] storageClassName: ceph-block dataSource: name: romm-data-migration kind: VolumeSnapshot apiGroup: snapshot.storage.k8s.io resources: requests: storage: 5Gi EOF Step 3: Update HelmRelease to use new PVC name\n1 2 3 4 # kubernetes/apps/games/romm/app/helmrelease.yaml persistence: data: existingClaim: romm # Changed from romm-data Step 4: Delete old PVC and resume\n1 2 3 kubectl delete pvc romm-data -n games kubectl delete volumesnapshot romm-data-migration -n games flux resume kustomization romm -n games The Gotcha: Conflicting dataSourceRef After the PVC migration, Flux complained about a dry-run failure. The manually-created PVC had dataSource: VolumeSnapshot, but the volsync template wanted dataSourceRef: ReplicationDestination. These are immutable.\nThe fix: delete the PVC and let the volsync template recreate it from a cloud restore:\n1 2 3 4 5 flux suspend kustomization romm -n games kubectl scale deploy romm -n games --replicas=0 kubectl delete pvc romm -n games flux resume kustomization romm -n games # The ReplicationDestination triggers a restore from R2 This works because we kept the Restic backups running throughout the migration.\nLessons Learned This migration surfaced several bad assumptions:\nAssumption Reality Impact MutatingAdmissionPolicy feature gate \u0026ldquo;just works\u0026rdquo; Requires Talos patch for apiServer extraArgs + runtime-config Had to create patch, regenerate configs, roll out to all nodes K8s 1.33 uses MutatingAdmissionPolicy v1beta1 Uses v1alpha1 (v1beta1 is K8s 1.34+) API server crash, had to fix and reapply ClusterSecretStore named onepassword Named onepassword-connect ExternalSecrets failed to sync app-template creates ConfigMap as kopia-config Creates as kopia Pod stuck in ContainerCreating Kopia repository pre-exists NFS path was empty Kopia server crashed on startup Reference patterns from docs were tested They were aspirational Multiple fixes needed Can rename PVC by creating new one from snapshot Works, but dataSource is immutable Had to delete and restore from cloud PVC dataSourceRef can be patched PVC spec is immutable after creation Kustomization dry-run failures VolumeSnapshotClass named csi-ceph-blockpool Named csi-ceph-block Snapshots failed to create Gateway named envoy-internal Named internal (in network namespace) HTTPRoute not attached to gateway Routes auto-create DNS records Need internal-dns.alpha.kubernetes.io/target annotation Hostname didn\u0026rsquo;t resolve Kopia Web UI works behind reverse proxy CSRF token validation fails Had to add --disable-csrf-token-checks Default UID 1000 is fine Should use 568 to match NFS share permissions Permission issues on NFS Can use hardcoded timezone Should use ${TIMEZONE} variable Inconsistent with cluster conventions Verification Commands Before implementing, always check:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # API version support kubectl api-resources | grep -i mutating # Existing resource names kubectl get clustersecretstore kubectl get volumesnapshotclass kubectl get storageclass kubectl get gateway -n network # Chart-generated resources helm template \u0026lt;release\u0026gt; | grep -i configmap # After migration, verify backup works kubectl get replicationsource \u0026lt;app\u0026gt; -n \u0026lt;namespace\u0026gt; kubectl exec -n storage deployment/kopia -- kopia snapshot list --all Step 6: The Successful Migration After fixing all the issues, migrating romm to Kopia was straightforward:\n1. Update the Kustomization to use the component:\n1 2 3 4 5 6 7 8 9 # kubernetes/apps/games/romm/app/kustomization.yaml apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization resources: - ./externalsecret.yaml - ./helmrelease.yaml - ../../../../templates/gatus/external components: - ../../../../components/volsync # All 3 destinations: NFS + B2 + R2 2. Set the required variables in ks.yaml:\n1 2 3 4 5 # kubernetes/apps/games/romm/ks.yaml postBuild: substitute: APP: romm VOLSYNC_CAPACITY: 5Gi That\u0026rsquo;s it! The VOLSYNC_UID and VOLSYNC_GID default to 568, which matches most apps. You only need to specify them if your app uses a different UID/GID:\n1 2 3 4 5 6 7 # Only if your app uses a non-standard UID/GID postBuild: substitute: APP: myapp VOLSYNC_CAPACITY: 10Gi VOLSYNC_UID: \u0026#34;1000\u0026#34; # Override the default 568 VOLSYNC_GID: \u0026#34;1000\u0026#34; 3. Commit, push, and reconcile:\n1 2 3 git add . \u0026amp;\u0026amp; git commit -m \u0026#34;feat(romm): switch volsync from Restic to Kopia\u0026#34; git push flux reconcile kustomization romm -n games 4. Verify the backups:\n1 2 3 4 5 6 7 8 9 10 11 # Check all three ReplicationSources were created kubectl get replicationsource -n games # NAME SOURCE LAST SYNC DURATION NEXT SYNC # romm romm 2025-12-06T00:39:46Z 3s 2025-12-06T01:00:00Z # romm-b2 romm 2025-12-06T00:00:00Z 45s 2025-12-07T00:00:00Z # romm-r2 romm 2025-12-06T00:00:00Z 48s 2025-12-07T00:00:00Z # Check snapshots in Kopia Web UI or via CLI kubectl exec -n storage deployment/kopia -- kopia snapshot list --all # romm@games:/data # 2025-12-06 13:39:46 NZDT k... 156.3 MB files:5582 dirs:285 The local NFS backup completed in 3 seconds because Kopia\u0026rsquo;s deduplication recognized the existing data. Cloud backups take longer but run daily for disaster recovery.\nUnderstanding Kopia\u0026rsquo;s Storage When I first looked at the NFS share after the migration, I was confused:\n1 2 3 4 5 6 7 8 9 10 11 VolsyncKopia ls -la drwxrwx--- 28 apps apps 32 Dec 6 13:35 . -rwxrwx--- 1 apps apps 43 Dec 6 08:07 .shards drwxrwx--- 3 apps apps 3 Dec 6 08:30 _lo -rwxrwx--- 1 apps apps 30 Dec 6 08:17 kopia.blobcfg.f -rwxrwx--- 1 apps apps 1117 Dec 6 12:31 kopia.maintenance.f -rwxrwx--- 1 apps apps 1101 Dec 6 08:17 kopia.repository.f drwxrwx--- 3 apps apps 3 Dec 6 12:31 p03 drwxrwx--- 3 apps apps 3 Dec 6 12:31 p1e drwxrwx--- 3 apps apps 3 Dec 6 12:31 q10 ... Where\u0026rsquo;s the romm folder? This is content-addressable storage - Kopia doesn\u0026rsquo;t store data by source name. Instead:\nData is deduplicated and compressed into pack blobs (the p*, q*, s* folders) Blob names are based on content hashes, not source names All apps share the same deduplication pool To see the logical structure, use kopia snapshot list --all or the Web UI This means if romm and another app have identical files, they\u0026rsquo;re only stored once. The tradeoff is you can\u0026rsquo;t browse the repository directly on the NAS - you need Kopia tools.\nImportant: Never manually delete files from the repository. Kopia uses garbage collection during maintenance to clean up unreferenced blobs safely.\nWhat\u0026rsquo;s Next The Kopia infrastructure is deployed and working. Romm is now successfully backing up to all three destinations:\nNFS (hourly) - Fast local restores Backblaze B2 (daily) - Off-site disaster recovery Cloudflare R2 (daily) - Additional cloud redundancy The next steps:\nromm (games) - Migrated to Kopia Done! Downloads namespace - qbittorrent, radarr, sonarr, etc. Entertainment namespace - plex, jellyfin, tautulli Home automation - home-assistant, zigbee2mqtt The key lesson: keep existing backups running during migration. Don\u0026rsquo;t switch to the new backup system until you\u0026rsquo;ve verified the PVC naming is correct and the app is stable. Having cloud backups as a safety net saved me from data loss multiple times during this migration.\nSummary Component Purpose MutatingAdmissionPolicy Auto-inject NFS volume into Volsync mover pods Kopia server Web UI for browsing/managing NFS backups components/volsync Root component that includes all 3 destinations components/volsync/nfs-truenas Primary backup (hourly) with restore capability components/volsync/s3-backblaze Disaster recovery to B2 (daily) components/volsync/s3-cloudflare Disaster recovery to R2 (daily) The migration from Restic to Kopia took longer than expected due to API version mismatches and incorrect assumptions about resource names. But the end result—a 3-2-1 backup strategy with local NFS for fast restores and dual cloud destinations for disaster recovery—is worth the effort. No more Restic lock issues!\nThis post documents part of the ongoing work on my home-ops repository. The patterns here are adapted from the excellent home-operations community repos.\n","date":"2025-12-06T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2025/volsync-kopia-migration/","title":"Migrating Volsync from Restic to Kopia"},{"content":" \u0026ldquo;When your Kustomizations all live in flux-system, cross-namespace dependencies become a tangled mess of implicit assumptions.\u0026rdquo;\nThe Problem with Everything in flux-system If you\u0026rsquo;ve run a Flux-managed Kubernetes cluster for any length of time, you\u0026rsquo;ve probably inherited (or created) the pattern where every Flux Kustomization CR lives in the flux-system namespace. It looks something like this:\n1 2 3 4 5 6 7 8 9 10 11 12 # kubernetes/apps/database/mosquitto/ks.yaml apiVersion: kustomize.toolkit.fluxcd.io/v1 kind: Kustomization metadata: name: mosquitto namespace: flux-system # Everything lives here spec: targetNamespace: database # But deploys resources here dependsOn: - name: external-secrets-stores # No namespace needed - it\u0026#39;s also in flux-system path: ./kubernetes/apps/database/mosquitto/app # ... This works, but it has problems:\nHidden coupling: Every dependsOn implicitly assumes the dependency is in flux-system. When you read the manifest, you can\u0026rsquo;t tell where resources actually live.\nCrowded namespace: Running flux get kustomizations dumps 80+ resources into one list. Finding the one that\u0026rsquo;s failing means scrolling through walls of text.\nNamespace isolation is fake: Your workloads deploy to separate namespaces, but their reconciliation state all lives in one place. RBAC, network policies, and observability tools can\u0026rsquo;t easily scope to \u0026ldquo;just the database apps.\u0026rdquo;\nThe substituteFrom trap: Flux\u0026rsquo;s variable substitution pulls ConfigMaps and Secrets from the Kustomization\u0026rsquo;s namespace. If your Kustomization is in flux-system but your app namespace has its own secrets, you need explicit cross-namespace references everywhere.\nThe Goal: Kustomizations in Their Target Namespaces The fix is straightforward in principle: move each Flux Kustomization into the namespace it actually manages. The result:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # kubernetes/apps/database/mosquitto/ks.yaml apiVersion: kustomize.toolkit.fluxcd.io/v1 kind: Kustomization metadata: name: \u0026amp;app mosquitto namespace: \u0026amp;namespace database # Lives where it deploys spec: targetNamespace: *namespace dependsOn: - name: external-secrets-stores namespace: external-secrets # Explicit cross-namespace reference sourceRef: kind: GitRepository name: flux-system namespace: flux-system # Git source is still in flux-system # ... Now kubectl get kustomizations -n database shows only database-related reconcilers. Dependencies are explicit. The mental model matches the deployment model.\nThe Challenge: substituteFrom and SOPS Decryption Here\u0026rsquo;s where it gets interesting. Flux Kustomizations support postBuild.substituteFrom to inject variables from ConfigMaps and Secrets:\n1 2 3 4 5 6 7 spec: postBuild: substituteFrom: - kind: ConfigMap name: cluster-settings - kind: Secret name: cluster-secrets The catch? From the Flux CRD documentation:\nName of the values referent. Should reside in the same namespace as the referring resource.\nWhen your Kustomization is in flux-system, it can reference cluster-settings and cluster-secrets which also live in flux-system. Move the Kustomization to database, and suddenly it can\u0026rsquo;t find those ConfigMaps anymore.\nThe same problem applies to SOPS decryption:\n1 2 3 4 5 spec: decryption: provider: sops secretRef: name: sops-age # Also needs to be in the same namespace You could solve this by copying cluster-settings, cluster-secrets, and sops-age into every namespace. But that defeats the purpose of having cluster-wide settings, and it\u0026rsquo;s a maintenance nightmare.\nThe Solution: Strategic Patching from cluster-apps The elegant solution is to use Flux\u0026rsquo;s patch capability at the parent Kustomization level. In my setup, cluster-apps is the top-level Kustomization that reconciles everything under kubernetes/apps/:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 # kubernetes/flux/cluster/ks.yaml apiVersion: kustomize.toolkit.fluxcd.io/v1 kind: Kustomization metadata: name: cluster-apps namespace: flux-system spec: path: ./kubernetes/apps prune: true sourceRef: kind: GitRepository name: flux-system postBuild: substituteFrom: - kind: ConfigMap name: cluster-settings - kind: Secret name: cluster-secrets patches: - patch: |- apiVersion: kustomize.toolkit.fluxcd.io/v1 kind: Kustomization metadata: name: _ spec: decryption: provider: sops secretRef: name: sops-age sourceRef: kind: GitRepository name: flux-system namespace: flux-system postBuild: substituteFrom: - kind: ConfigMap name: cluster-settings optional: true - kind: Secret name: cluster-secrets optional: true target: group: kustomize.toolkit.fluxcd.io kind: Kustomization This patch is applied to every child Kustomization that cluster-apps creates. The key insight: the patch adds namespace: flux-system to the sourceRef and substituteFrom references, so child Kustomizations can live anywhere while still pulling variables from flux-system.\nBreaking Down the Patch Let\u0026rsquo;s look at what this accomplishes:\nsourceRef.namespace: flux-system: Child Kustomizations reference the GitRepository in flux-system, regardless of where they live.\nsubstituteFrom with optional: true: Variables are pulled from flux-system, but if they don\u0026rsquo;t exist, reconciliation continues (useful for namespace-specific overrides).\nSOPS decryption: The sops-age secret reference is injected, so encrypted secrets work everywhere.\nname: _: This is a patch placeholder - Flux will apply this to all matching resources.\nThe Migration Pattern With the patching in place, migrating each Kustomization follows a consistent pattern:\nBefore (in flux-system) 1 2 3 4 5 6 7 8 9 10 11 12 13 apiVersion: kustomize.toolkit.fluxcd.io/v1 kind: Kustomization metadata: name: mosquitto namespace: flux-system spec: targetNamespace: database dependsOn: - name: dragonfly path: ./kubernetes/apps/database/mosquitto/app sourceRef: kind: GitRepository name: flux-system After (in target namespace) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 apiVersion: kustomize.toolkit.fluxcd.io/v1 kind: Kustomization metadata: name: \u0026amp;app mosquitto namespace: \u0026amp;namespace database spec: targetNamespace: *namespace dependsOn: - name: dragonfly namespace: database # Explicit namespace required path: ./kubernetes/apps/database/mosquitto/app sourceRef: kind: GitRepository name: flux-system namespace: flux-system # Cross-namespace reference Key Changes Add metadata.namespace: Point to the target namespace using a YAML anchor for DRY.\nAdd namespace to dependsOn: Every cross-namespace dependency needs an explicit namespace. Same-namespace dependencies can omit it, but I recommend always including it for clarity.\nAdd sourceRef.namespace: flux-system: The GitRepository stays in flux-system, so child Kustomizations need to reach across.\nReusable Components: The Common Pattern To reduce boilerplate, I created a shared component that every namespace includes:\n1 2 3 4 5 6 7 8 # kubernetes/components/common/kustomization.yaml apiVersion: kustomize.config.k8s.io/v1alpha1 kind: Component resources: - ./namespace.yaml - ./cluster-vars - ./alerts - ./sops Each namespace\u0026rsquo;s kustomization.yaml pulls this in:\n1 2 3 4 5 6 7 8 9 10 11 12 # kubernetes/apps/database/kustomization.yaml apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization namespace: database components: - ../../components/common resources: - ./cloudnative-pg/ks.yaml - ./dragonfly/ks.yaml - ./mosquitto/ks.yaml This ensures every namespace gets:\nA properly-labeled Namespace resource Cluster-wide variables (ConfigMaps/Secrets) Flux alerts and providers SOPS external secrets Real-World Gotchas 1. DNS Resolution During Migration During my migration, I hit a chicken-and-egg problem. CoreDNS was configured to forward internal DNS to a cluster-internal DNS service, but when that service drifted or wasn\u0026rsquo;t ready, Flux couldn\u0026rsquo;t fetch from Git because DNS was broken.\nThe fix: simplify DNS architecture. I later removed k8s-gateway entirely and configured CoreDNS to forward to my UDM Pro (10.90.254.1), which has DNS records created by external-dns-unifi. This eliminates cluster-internal DNS dependencies during bootstrap.\n1 2 3 4 5 6 7 # kubernetes/apps/kube-system/coredns/app/helm-values.yaml servers: - zones: - zone: . plugins: - name: forward parameters: . /etc/resolv.conf # Forwards to UDM (10.90.254.1) 2. Rook-Ceph and Storage Dependencies Storage operators like Rook-Ceph are sensitive to manifest changes. Moving Kustomizations around can trigger reconciliation loops that confuse the operator about existing OSDs.\nMy approach: migrate storage-adjacent namespaces last, and be prepared to wipe and rebuild if Ceph gets confused (see my Talos DR Reset post for that adventure).\n3. The cluster-apps-* Naming Convention Some apps had legacy names like cluster-apps-rook-ceph-cluster. When migrating, I renamed them to just rook-ceph-cluster. This meant updating every dependsOn that referenced the old name.\nA grep through the codebase found all the references:\n1 grep -r \u0026#34;cluster-apps-\u0026#34; kubernetes/apps/ --include=\u0026#34;*.yaml\u0026#34; Validation After migrating each namespace, I validated with:\n1 2 3 4 5 6 7 8 # Check all Kustomizations in the namespace are Ready flux get kustomizations -n database # Force reconcile to ensure no cached state flux reconcile kustomization mosquitto -n database --force # Verify HelmReleases deployed correctly flux get helmreleases -n database For apps that weren\u0026rsquo;t deployed yet (commented out in kustomization.yaml), I verified the YAML was syntactically correct:\n1 kustomize build kubernetes/apps/cortex --load-restrictor=LoadRestrictionsNone The End Result After migrating all namespaces, my cluster has:\nClear namespace boundaries: kubectl get kustomizations -n downloads shows only download-related apps Explicit dependencies: No more guessing where a dependency lives Scoped observability: Prometheus can scrape per-namespace, dashboards can filter by namespace Simpler RBAC: Namespace-scoped roles can manage their own Flux resources The cluster-apps parent Kustomization still lives in flux-system (it has to - it\u0026rsquo;s the entry point), but everything it spawns now lives where it belongs.\nSummary Aspect Before After Kustomization location All in flux-system Each in target namespace dependsOn references Implicit (same namespace) Explicit with namespace sourceRef Implicit (same namespace) Explicit namespace: flux-system substituteFrom Direct reference Patched from parent with namespace: flux-system Observability One giant list Namespaced views The migration took several sessions and touched 200+ files, but the result is a cleaner, more maintainable GitOps structure. If you\u0026rsquo;re running a homelab with Flux, I highly recommend making this change before your cluster grows any larger.\nThis post documents the migration I performed on my home-ops repository. The patterns here are heavily inspired by Kashalls\u0026rsquo; homelab repo and the broader Kubernetes@Home community.\n","date":"2025-12-05T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2025/flux-namespace-migration/","title":"Migrating Flux Kustomizations Out of flux-system"},{"content":" \u0026ldquo;The best infrastructure is the infrastructure you delete.\u0026rdquo; — Me, staring at 23 proxy pods\nThe Problem: Proxy Sprawl I\u0026rsquo;ve been running Tailscale in my homelab for remote access, using the Tailscale Operator\u0026rsquo;s Ingress feature. It works great — you add a tailscale ingress class to your app, and the operator spins up a proxy pod that appears in your Tailnet. Access it via MagicDNS (paperless.${TAILNET_DNS_NAME}) and you\u0026rsquo;re in.\nThe problem? I had 23 of these proxy pods:\n1 2 3 4 5 6 kubectl get pods -n network | grep ts- ts-filebrowser-l4k64-0 1/1 Running ts-homepage-w2j25-0 1/1 Running ts-paperless-r892j-0 1/1 Running ts-teslamate-wthbr-0 1/1 Running # ... many more Each app gets its own Tailscale device, its own WireGuard tunnel, its own memory footprint. And the URLs are ugly — paperless.${TAILNET_DNS_NAME} instead of just paperless.${SECRET_DOMAIN}.\nWhat I really wanted: type paperless.${SECRET_DOMAIN} from anywhere and have it Just Work. On the LAN, on Tailscale, wherever.\nThe Goal: Same URL Everywhere The dream:\nLocation URL Result LAN paperless.${SECRET_DOMAIN} Resolves to internal gateway, works Tailscale (remote) paperless.${SECRET_DOMAIN} Resolves to internal gateway via WireGuard, works Public internet paperless.${SECRET_DOMAIN} No access (internal-only app) One URL. Zero extra infrastructure. No per-app proxy pods.\nThe Solution: Split DNS + Connector The key insight: my internal gateway (10.90.3.202) is already reachable via Tailscale — I just need DNS queries to return that IP when I\u0026rsquo;m connected to the Tailnet.\nThis requires two pieces:\nTailscale Connector — A pod that advertises my cluster subnet (10.90.0.0/16) to the Tailnet Split DNS — Configure Tailscale to forward *.${SECRET_DOMAIN} queries to my UDM Pro (which has DNS records created by external-dns-unifi) How It Works 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Remote Device (Tailscale connected) │ │ 1. Browser: paperless.${SECRET_DOMAIN} │ 2. OS DNS query ▼ ┌─────────────────┐ │ Tailscale Client│ 3. Intercepts DNS (Split DNS configured) │ │ for *.${SECRET_DOMAIN} └────────┬────────┘ │ │ 4. Forward DNS query via WireGuard tunnel ▼ ┌─────────────────┐ │ UDM Pro │ 5. Resolves paperless.${SECRET_DOMAIN} │ 10.90.254.1 │ Returns: 10.90.3.202 (internal gateway) └────────┬────────┘ (record created by external-dns-unifi) │ │ 6. Response travels back via WireGuard ▼ ┌─────────────────┐ │ Tailscale Client│ 7. Browser now knows IP: 10.90.3.202 └────────┬────────┘ │ │ 8. HTTPS request to 10.90.3.202 via WireGuard ▼ ┌─────────────────┐ │ Internal Gateway│ 9. Matches HTTPRoute, serves request │ 10.90.3.202 │ └─────────────────┘ The beauty: my internal gateway doesn\u0026rsquo;t know or care whether the request came from the LAN or through Tailscale. It\u0026rsquo;s just another client hitting 10.90.3.202.\nSetting Up the Connector First, I needed a subnet router so Tailscale clients can reach my cluster IPs. The Tailscale Operator makes this easy with the Connector CRD:\n1 2 3 4 5 6 7 8 9 10 11 # kubernetes/apps/network/tailscale/operator/app/connector.yaml --- apiVersion: tailscale.com/v1alpha1 kind: Connector metadata: name: home-subnet spec: hostname: home-subnet-router subnetRouter: advertiseRoutes: - 10.90.0.0/16 Add it to your kustomization and push:\n1 2 3 4 5 6 7 8 # kubernetes/apps/network/tailscale/operator/app/kustomization.yaml --- apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization resources: - ./externalsecret.yaml - ./helmrelease.yaml - ./connector.yaml After Flux reconciles, you\u0026rsquo;ll see a new pod and Tailscale device:\n1 2 3 kubectl get connector -n network NAME SUBNETROUTES STATUS AGE home-subnet 10.90.0.0/16 ConnectorCreated 5m Warning You need to approve the subnet routes in Tailscale Admin. Go to Machines, find home-subnet-router, and approve the 10.90.0.0/16 route.\nConfiguring Tailscale Split DNS Now the fun part. In the Tailscale Admin Console:\nNavigate to the DNS tab Under Nameservers, click Add nameserver → Custom\u0026hellip; Configure: Nameserver: 10.90.254.1 (your UDM Pro) Check Restrict to domain Domain: ${SECRET_DOMAIN} The dialog looks like this:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ┌─────────────────────────────────────────────────────────────────┐ │ Add nameserver [x] │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Nameserver │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ 10.90.254.1 │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ ☑ Restrict to domain │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ ${SECRET_DOMAIN} │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ This nameserver will only be used for DNS queries matching │ │ *.${SECRET_DOMAIN} │ │ │ │ [Cancel] [Save] │ └─────────────────────────────────────────────────────────────────┘ I also enabled Override local DNS to ensure Tailscale\u0026rsquo;s DNS config takes precedence when connected.\nInfo Why UDM Pro instead of k8s-gateway? I previously used k8s-gateway for internal DNS, but later migrated to using external-dns-unifi which pushes DNS records directly to my UDM Pro. This simplifies the architecture — UDM serves the same DNS records to LAN clients, pods (via CoreDNS forwarding), and Tailscale clients. One source of truth for internal DNS.\nTesting It Works From a Tailscale-connected device (away from home):\n1 2 3 4 5 # Check DNS resolution dig paperless.${SECRET_DOMAIN} # Expected: # paperless.${SECRET_DOMAIN}. 0 IN A 10.90.3.202 From LAN (without Tailscale):\n1 2 3 4 dig paperless.${SECRET_DOMAIN} # Expected (via UDM internal DNS): # paperless.${SECRET_DOMAIN}. 0 IN A 10.90.3.202 Both return the same IP — my internal gateway. Same URL, same destination, regardless of where I am.\nThe Migration: Removing Tailscale Ingresses With Split DNS working, those 23 proxy pods became redundant. Time to delete them.\nBefore (per-app Tailscale proxy):\n1 2 3 4 5 6 7 # Each app had this ingress: tailscale: enabled: true className: tailscale hosts: - host: paperless # MagicDNS name only After (just the internal route):\n1 2 3 4 5 6 7 8 9 10 11 12 # Same internal route serves both LAN and Tailscale route: app: annotations: internal-dns.alpha.kubernetes.io/target: internal.${SECRET_DOMAIN} hostnames: - paperless.${SECRET_DOMAIN} parentRefs: - name: internal namespace: network # No ingress.tailscale block! The migration is straightforward:\nConfigure Split DNS in Tailscale admin (done above) Verify access works via paperless.${SECRET_DOMAIN} on Tailscale Remove the ingress.tailscale blocks from HelmReleases Clean up orphaned Tailscale devices in admin console The Numbers Metric Before (Ingress) After (Split DNS) Tailscale proxy pods 23 0 Tailscale devices 24 (proxies + connector) 1 (connector) New infrastructure - 1 Connector pod URLs to remember 23 MagicDNS names 0 (same as LAN) Why This Works Tailscale\u0026rsquo;s Split DNS feature intercepts DNS queries at the OS level. When I look up paperless.${SECRET_DOMAIN}:\nOn LAN: Query goes to my UDM Pro, which has internal DNS records (created by external-dns-unifi) pointing to 10.90.3.202 On Tailscale: Query is intercepted and forwarded through the WireGuard tunnel to 10.90.254.1 (UDM Pro), which returns 10.90.3.202 In both cases, the browser gets 10.90.3.202. The subsequent HTTPS request goes directly to the internal gateway — on LAN via the local network, on Tailscale via the WireGuard mesh.\nThe Connector\u0026rsquo;s subnet advertisement is what makes 10.90.3.202 reachable from Tailscale. Without it, DNS would resolve correctly but the connection would timeout.\nLessons Learned Split DNS is the right pattern — Per-app proxies were solving the wrong problem. I didn\u0026rsquo;t need 23 WireGuard tunnels; I needed one subnet route and proper DNS.\nConnectors are underrated — The Tailscale Operator\u0026rsquo;s Connector CRD is incredibly simple. One YAML file, and suddenly your entire cluster subnet is on your Tailnet.\nSame URL everywhere matters — Having to remember paperless.${TAILNET_DNS_NAME} vs paperless.${SECRET_DOMAIN} was annoying. Now I just use the real URL regardless of where I am.\nDelete infrastructure when you can — Those 23 proxy pods weren\u0026rsquo;t free. They consumed memory, created noise in kubectl get pods, and cluttered my Tailscale device list. Sometimes the best optimization is removal.\nReferences Tailscale DNS Documentation — Official Split DNS guide What is Split DNS? — Conceptual overview Tailscale Subnet Routers — Making internal networks reachable Tailscale Kubernetes Operator — Connector CRD docs ","date":"2025-12-02T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2025/tailscale-split-dns/","title":"Killing 23 Tailscale Proxies with Split DNS"},{"content":" \u0026ldquo;Give a man a fish and he eats for a day. Give an AI access to your cluster via MCP and watch it debug your HelmReleases at 2am.\u0026rdquo; — Me, probably sleep-deprived\nWhat is MCP? Model Context Protocol (MCP) is Anthropic\u0026rsquo;s open standard for connecting AI assistants to external tools and data sources. Think of it as giving your AI a set of hands — instead of just chatting, it can actually do things: query your Prometheus metrics, check Flux reconciliation status, browse GitHub PRs, or even execute kubectl commands.\nIf you\u0026rsquo;re running Claude Code in VS Code (or the CLI), MCP servers let you extend what Claude can access. Instead of copy-pasting error logs into the chat, Claude can pull them directly. Instead of describing your cluster state, Claude can just\u0026hellip; look.\nThe Setup I run a Kubernetes homelab managed via GitOps (Flux + Talos), and I wanted Claude to have full visibility into everything without me having to be the middleman. After a few evenings of tinkering, I landed on 14 MCP servers that cover infrastructure, observability, databases, code, and more.\nHere\u0026rsquo;s what\u0026rsquo;s running:\nServer Purpose kubernetes Native k8s resource access flux GitOps status, reconciliation talos Talos Linux node management helm Chart inspection and values grafana Dashboards, datasources, alerts prometheus PromQL queries, metric discovery databases PostgreSQL + MariaDB queries github PRs, issues, code search shell Controlled command execution mermaid Diagram validation eraser Diagram creation (Eraser.io) firecrawl Web scraping and search cloudflare-docs CF documentation search repoql Semantic codebase queries The Configuration MCP servers are configured in a .mcp.json file. I keep mine in the root of my home-ops repo so it\u0026rsquo;s shared between VS Code and Claude Code CLI.\nInfo MCP does not expand ~/ in paths. Always use full absolute paths like /home/username/... instead of ~/... in your .mcp.json configuration.\nHere\u0026rsquo;s the full config:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 { \u0026#34;mcpServers\u0026#34;: { \u0026#34;kubernetes\u0026#34;: { \u0026#34;command\u0026#34;: \u0026#34;npx\u0026#34;, \u0026#34;args\u0026#34;: [\u0026#34;-y\u0026#34;, \u0026#34;kubernetes-mcp-server@latest\u0026#34;] }, \u0026#34;flux\u0026#34;: { \u0026#34;command\u0026#34;: \u0026#34;flux-operator-mcp\u0026#34;, \u0026#34;args\u0026#34;: [\u0026#34;serve\u0026#34;] }, \u0026#34;talos\u0026#34;: { \u0026#34;command\u0026#34;: \u0026#34;/home/\u0026lt;user\u0026gt;/mcp/talos/.venv/bin/python\u0026#34;, \u0026#34;args\u0026#34;: [\u0026#34;/home/\u0026lt;user\u0026gt;/mcp/talos/src/talos_mcp/server.py\u0026#34;], \u0026#34;env\u0026#34;: { \u0026#34;TALOSCONFIG\u0026#34;: \u0026#34;/home/\u0026lt;user\u0026gt;/home-ops/talosconfig\u0026#34; } }, \u0026#34;helm\u0026#34;: { \u0026#34;command\u0026#34;: \u0026#34;/home/\u0026lt;user\u0026gt;/mcp/mcp-helm\u0026#34;, \u0026#34;args\u0026#34;: [\u0026#34;-mode=stdio\u0026#34;] }, \u0026#34;grafana\u0026#34;: { \u0026#34;command\u0026#34;: \u0026#34;/home/\u0026lt;user\u0026gt;/go/bin/mcp-grafana\u0026#34;, \u0026#34;env\u0026#34;: { \u0026#34;GRAFANA_URL\u0026#34;: \u0026#34;${GRAFANA_URL}\u0026#34;, \u0026#34;GRAFANA_SERVICE_ACCOUNT_TOKEN\u0026#34;: \u0026#34;${GRAFANA_SERVICE_ACCOUNT_TOKEN}\u0026#34; } }, \u0026#34;shell\u0026#34;: { \u0026#34;command\u0026#34;: \u0026#34;uvx\u0026#34;, \u0026#34;args\u0026#34;: [\u0026#34;mcp-shell-server\u0026#34;], \u0026#34;env\u0026#34;: { \u0026#34;ALLOW_COMMANDS\u0026#34;: \u0026#34;ls,cat,pwd,grep,wc,find,head,tail,sort,uniq,cut,tr,sed,awk,jq,yq,kubectl,flux,talosctl,task,git,docker,helm\u0026#34; } }, \u0026#34;mermaid\u0026#34;: { \u0026#34;command\u0026#34;: \u0026#34;npx\u0026#34;, \u0026#34;args\u0026#34;: [\u0026#34;-y\u0026#34;, \u0026#34;@probelabs/maid-mcp\u0026#34;] }, \u0026#34;eraser\u0026#34;: { \u0026#34;command\u0026#34;: \u0026#34;docker\u0026#34;, \u0026#34;args\u0026#34;: [ \u0026#34;run\u0026#34;, \u0026#34;-i\u0026#34;, \u0026#34;--rm\u0026#34;, \u0026#34;-e\u0026#34;, \u0026#34;ERASER_API_KEY=${ERASER_API_KEY}\u0026#34;, \u0026#34;eraser-mcp:claude\u0026#34; ] }, \u0026#34;github\u0026#34;: { \u0026#34;command\u0026#34;: \u0026#34;/home/\u0026lt;user\u0026gt;/mcp/github-mcp-server\u0026#34;, \u0026#34;args\u0026#34;: [\u0026#34;stdio\u0026#34;], \u0026#34;env\u0026#34;: { \u0026#34;GITHUB_PERSONAL_ACCESS_TOKEN\u0026#34;: \u0026#34;${GITHUB_PERSONAL_ACCESS_TOKEN}\u0026#34; } }, \u0026#34;prometheus\u0026#34;: { \u0026#34;command\u0026#34;: \u0026#34;npx\u0026#34;, \u0026#34;args\u0026#34;: [\u0026#34;-y\u0026#34;, \u0026#34;prometheus-mcp@latest\u0026#34;, \u0026#34;stdio\u0026#34;], \u0026#34;env\u0026#34;: { \u0026#34;PROMETHEUS_URL\u0026#34;: \u0026#34;${PROMETHEUS_URL}\u0026#34; } }, \u0026#34;databases\u0026#34;: { \u0026#34;command\u0026#34;: \u0026#34;uvx\u0026#34;, \u0026#34;args\u0026#34;: [\u0026#34;database-mcp\u0026#34;], \u0026#34;env\u0026#34;: { \u0026#34;DB_CONFIGS\u0026#34;: \u0026#34;${DB_CONFIGS}\u0026#34; } }, \u0026#34;firecrawl\u0026#34;: { \u0026#34;command\u0026#34;: \u0026#34;npx\u0026#34;, \u0026#34;args\u0026#34;: [\u0026#34;-y\u0026#34;, \u0026#34;firecrawl-mcp\u0026#34;], \u0026#34;env\u0026#34;: { \u0026#34;FIRECRAWL_API_KEY\u0026#34;: \u0026#34;${FIRECRAWL_API_KEY}\u0026#34; } }, \u0026#34;cloudflare-docs\u0026#34;: { \u0026#34;command\u0026#34;: \u0026#34;npx\u0026#34;, \u0026#34;args\u0026#34;: [\u0026#34;-y\u0026#34;, \u0026#34;mcp-remote\u0026#34;, \u0026#34;https://docs.mcp.cloudflare.com/mcp\u0026#34;] } }, \u0026#34;servers\u0026#34;: { \u0026#34;repoql\u0026#34;: { \u0026#34;type\u0026#34;: \u0026#34;stdio\u0026#34;, \u0026#34;command\u0026#34;: \u0026#34;/home/\u0026lt;user\u0026gt;/mcp/repoql\u0026#34;, \u0026#34;args\u0026#34;: [\u0026#34;mcp\u0026#34;] } } } Environment Variables Sensitive values like API keys and tokens live in a ~/.secrets file that gets sourced by my shell:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # ~/.secrets # API Keys export ERASER_API_KEY=your-eraser-api-key export FIRECRAWL_API_KEY=your-firecrawl-key # GitHub export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxxxx # Grafana export GRAFANA_URL=https://grafana.yourdomain.com export GRAFANA_SERVICE_ACCOUNT_TOKEN=glsa_xxxxx # Prometheus export PROMETHEUS_URL=https://prometheus.yourdomain.com # Database MCP Config export DB_CONFIGS=\u0026#39;[{\u0026#34;id\u0026#34;:\u0026#34;postgres\u0026#34;,\u0026#34;db_type\u0026#34;:\u0026#34;pg\u0026#34;,\u0026#34;configuration\u0026#34;:{\u0026#34;host\u0026#34;:\u0026#34;10.x.x.x\u0026#34;,\u0026#34;port\u0026#34;:5432,\u0026#34;user\u0026#34;:\u0026#34;postgres\u0026#34;,\u0026#34;password\u0026#34;:\u0026#34;yourpassword\u0026#34;,\u0026#34;dbname\u0026#34;:\u0026#34;postgres\u0026#34;},\u0026#34;description\u0026#34;:\u0026#34;CloudNative-PG\u0026#34;},{\u0026#34;id\u0026#34;:\u0026#34;mariadb\u0026#34;,\u0026#34;db_type\u0026#34;:\u0026#34;mysql\u0026#34;,\u0026#34;configuration\u0026#34;:{\u0026#34;host\u0026#34;:\u0026#34;10.x.x.x\u0026#34;,\u0026#34;port\u0026#34;:3306,\u0026#34;user\u0026#34;:\u0026#34;root\u0026#34;,\u0026#34;password\u0026#34;:\u0026#34;yourpassword\u0026#34;,\u0026#34;database\u0026#34;:\u0026#34;mysql\u0026#34;},\u0026#34;description\u0026#34;:\u0026#34;MariaDB\u0026#34;}]\u0026#39; Source this in your .zshrc or .bashrc:\n1 [ -f ~/.secrets ] \u0026amp;\u0026amp; source ~/.secrets Server-Specific Setup Grafana Service Account Grafana MCP needs a service account token. Here\u0026rsquo;s how to create one programmatically:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # Port-forward to Grafana kubectl port-forward -n observability svc/grafana 3000:80 \u0026amp; # Get admin credentials kubectl get secret -n observability grafana-admin-secret -o jsonpath=\u0026#39;{.data.GF_SECURITY_ADMIN_USER}\u0026#39; | base64 -d kubectl get secret -n observability grafana-admin-secret -o jsonpath=\u0026#39;{.data.GF_SECURITY_ADMIN_PASSWORD}\u0026#39; | base64 -d # Create service account with Viewer role curl -X POST http://localhost:3000/api/serviceaccounts \\ -H \u0026#34;Content-Type: application/json\u0026#34; \\ -u \u0026#34;admin:yourpassword\u0026#34; \\ -d \u0026#39;{\u0026#34;name\u0026#34;: \u0026#34;mcp-grafana-reader\u0026#34;, \u0026#34;role\u0026#34;: \u0026#34;Viewer\u0026#34;}\u0026#39; # Generate token (replace SERVICE_ACCOUNT_ID with the ID from above) curl -X POST http://localhost:3000/api/serviceaccounts/SERVICE_ACCOUNT_ID/tokens \\ -H \u0026#34;Content-Type: application/json\u0026#34; \\ -u \u0026#34;admin:yourpassword\u0026#34; \\ -d \u0026#39;{\u0026#34;name\u0026#34;: \u0026#34;mcp-token\u0026#34;}\u0026#39; Save the token to your ~/.secrets file.\nEraser (Docker) The Eraser MCP server isn\u0026rsquo;t on npm, so I built it locally as a Docker container:\n1 2 cd /home/\u0026lt;user\u0026gt;/mcp/eraser docker build -t eraser-mcp:claude . The config references this local image and passes the API key via environment variable.\nDatabases The database-mcp package expects a DB_CONFIGS environment variable containing JSON configuration. Important: it only supports pg (PostgreSQL), mysql (MariaDB/MySQL), mssql, bigquery, oracle, and sqlite. Redis/DragonflyDB are not supported.\nTesting Your Setup After configuring everything, restart VS Code and check the MCP servers are loading. You can test individual servers:\n1 2 3 4 5 6 7 # In Claude Code or VS Code with Claude extension # Just ask Claude to use the tools: \u0026#34;List all Flux Kustomizations\u0026#34; \u0026#34;Query Prometheus for node memory usage\u0026#34; \u0026#34;Show me the Grafana datasources\u0026#34; \u0026#34;List tables in the postgres database\u0026#34; If a server fails to load, check the VS Code Output panel (View → Output → select \u0026ldquo;Claude\u0026rdquo; from dropdown) for error messages.\nThe Journey: Challenges and Fixes What follows is the debugging saga. If you just wanted the guide, you\u0026rsquo;re done — go forth and MCP. But if you\u0026rsquo;re troubleshooting issues or just enjoy watching someone else suffer through config errors, read on.\nGrafana: 401 Unauthorized Symptom: Grafana MCP failed with 401 Unauthorized\nCause: No authentication configured. The MCP server needs either basic auth or a service account token.\nFix: Created a service account with Viewer role (see setup above). The token goes in GRAFANA_SERVICE_ACCOUNT_TOKEN.\nTalos: \u0026ldquo;No Talos configuration loaded\u0026rdquo; Symptom: All Talos MCP tools returned \u0026ldquo;No Talos configuration loaded\u0026rdquo;\nCause: Bug in the Talos MCP server — it wasn\u0026rsquo;t reading the TALOSCONFIG environment variable. The code instantiated TalosClient() without passing the config path.\nFix: Modified /home/\u0026lt;user\u0026gt;/mcp/talos/src/talos_mcp/server.py line 115:\n1 2 3 4 5 # Before talos_client = TalosClient() # After talos_client = TalosClient(config_path=os.environ.get(\u0026#34;TALOSCONFIG\u0026#34;)) I submitted a PR for this fix: https://github.com/ry-ops/talos-mcp-server/pull/1\nDatabases: \u0026ldquo;Unsupported database type: redis\u0026rdquo; Symptom: Database MCP failed to start entirely\nCause: I had DragonflyDB (Redis-compatible) in my DB_CONFIGS with \u0026quot;db_type\u0026quot;:\u0026quot;redis\u0026quot;. The database-mcp package doesn\u0026rsquo;t support Redis.\nFix: Removed the DragonflyDB entry from DB_CONFIGS. Only PostgreSQL and MariaDB remain.\nDatabases: Environment Variable Expansion Symptom: Database passwords weren\u0026rsquo;t being used — connection failures\nCause: I had nested environment variables in .mcp.json:\n1 \u0026#34;DB_CONFIGS\u0026#34;: \u0026#34;[{...\\\u0026#34;password\\\u0026#34;:\\\u0026#34;${CNPG_PASSWORD}\\\u0026#34;...}]\u0026#34; The shell doesn\u0026rsquo;t expand ${CNPG_PASSWORD} inside a JSON string inside an env var.\nFix: Moved the entire DB_CONFIGS JSON (with passwords pre-embedded) to ~/.secrets. The .mcp.json just references ${DB_CONFIGS} which expands to the full JSON.\nMermaid: Package Name Change Symptom: Mermaid MCP failed with \u0026ldquo;Connection closed\u0026rdquo;\nCause: Package moved from @probelabs/maid mcp to @probelabs/maid-mcp\nFix: Updated the args in .mcp.json:\n1 \u0026#34;args\u0026#34;: [\u0026#34;-y\u0026#34;, \u0026#34;@probelabs/maid-mcp\u0026#34;] Eraser: npm 404 Symptom: npm error 404 Not Found - eraser-io-mcp-server\nCause: The package eraser-io-mcp-server doesn\u0026rsquo;t exist on npm. The only Eraser MCP is a GitHub repo that needs to be built locally.\nFix: Cloned the repo, built a Docker image, and configured MCP to run it via Docker:\n1 2 3 4 \u0026#34;eraser\u0026#34;: { \u0026#34;command\u0026#34;: \u0026#34;docker\u0026#34;, \u0026#34;args\u0026#34;: [\u0026#34;run\u0026#34;, \u0026#34;-i\u0026#34;, \u0026#34;--rm\u0026#34;, \u0026#34;-e\u0026#34;, \u0026#34;ERASER_API_KEY=${ERASER_API_KEY}\u0026#34;, \u0026#34;eraser-mcp:claude\u0026#34;] } Filesystem: Redundant Symptom: Filesystem MCP showed \u0026ldquo;Failed to fetch tools\u0026rdquo;\nCause: The filesystem MCP requires allowed directories to be specified in args. Without them, it has no permissions and exposes no tools.\nResolution: Removed it entirely. The shell MCP already handles file operations via cat, ls, etc., plus gives access to kubectl/flux/talosctl. Filesystem MCP was redundant.\nContext Usage: The Cost of Power One thing worth knowing: MCP tools consume tokens. Each tool definition takes up space in Claude\u0026rsquo;s context window, and with 14 servers loaded, it adds up fast.\nYou can check your context usage anytime by running /context in Claude Code:\n1 2 3 4 5 6 7 8 9 10 11 12 13 Context Usage Model: claude-opus-4-5-20251101 Tokens: 199.6k / 200.0k (100%) Categories Category Tokens Percentage System prompt 3.0k 1.5% System tools 13.9k 6.9% MCP tools 137.3k 68.7% Memory files 460 0.2% Messages 8 0.0% Free space 375 0.2% Autocompact buffer 45.0k 22.5% Yeah, 137k tokens just for MCP tool definitions. That\u0026rsquo;s 68% of the context window before Claude even reads a single file or responds to a message.\nSome of the heavier servers by token count:\nServer Tools ~Tokens grafana 55 tools ~42k github 42 tools ~30k flux 17 tools ~11k kubernetes 21 tools ~14k repoql 3 tools ~4.5k firecrawl 6 tools ~8.5k If you\u0026rsquo;re hitting context limits, consider disabling servers you don\u0026rsquo;t actively need. The shell MCP alone (716 tokens) gives you kubectl/flux/talosctl access — you might not need the dedicated kubernetes MCP (14k tokens) for basic queries.\nWrapping Up 14 MCP servers later, Claude can now:\nCheck my Flux reconciliation status Query Prometheus metrics List Grafana dashboards and datasources Run SQL queries against my databases Create diagrams on Eraser.io Execute controlled shell commands Search and read GitHub repos Is it overkill? Probably. Is it fun watching an AI debug my cluster at 2am while I drink coffee? Absolutely.\nThe config is in my home-ops repo if you want to steal it.\n","date":"2025-11-26T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2025/mcp-servers-vscode/","title":"Running 14 MCP Servers in VS Code for Homelab Mastery"},{"content":" \u0026ldquo;When in doubt: blkdiscard, talhelper, a good taskfile, and a mug of coffee\u0026hellip;or in my case Musashi Energy - lemonade\u0026rdquo;\nWhy I nuked the cluster I wanted to simplify how Flux applies applications. Historically the top-level Kustomize for every workload lived inside flux-system, which meant lots of cross-namespace references and annoyingly long paths. The plan was to move each app\u0026rsquo;s Flux Kustomization into the namespace it actually manages so that kubernetes/apps/\u0026lt;namespace\u0026gt;/\u0026lt;app\u0026gt;/ks.yaml is the single source of truth.\nThat migration touched everything that rolls pods, including storage. Rook-Ceph saw the new manifests, decided that it needed to reconcile, and immediately tripped over the existing OSD data on disk. The net result: the operator kept trying to reformat disks that still held live block devices, the rollout failed in a messy state, and the cluster stopped scheduling anything critical. At that point it was faster (and safer) to pave the nodes and start clean than it was to try to coax Rook back to health while half the controllers were stuck.\nInfo I\u0026rsquo;m assuming you are also doing DR, so you will have the existing CLI tools installed\nWiping the nodes I keep a USB stick with ubuntu-24.04.3-desktop-amd64.iso around for this exact situation. Live-booting into \u0026ldquo;Try Ubuntu\u0026rdquo; gives me a predictable environment, modern nvme-cli, and a desktop for sanity checks. On each node (Stanton-01/02/03) I confirmed the disks with lsblk:\n1 lsblk My Talos install lives on two NVMe devices:\nnvme1n1 (990 GB)\nThe TalosOS Disk Where Open EBS hostpath PVCs live nvme0n1 (1.75 TB)\nDedicated RookCeph Disk Runs on a Thunderbolt ring network Once confirmed, I zeroed the flash translation layer and wiped the labels:\n1 2 3 4 sudo blkdiscard -f /dev/nvme0n1 sudo blkdiscard -f /dev/nvme1n1 sudo wipefs -a /dev/nvme0n1 sudo wipefs -a /dev/nvme1n1 Wiping Disks Then I ran fdisk to take a look at the NVMEs\n1 sudo fdisk -l Running fdisk-l blkdiscard is the important bit for Rook. It makes sure the SSD firmware really forgets the previous partitions, which prevents Ceph from thinking an OSD is already provisioned the next time it boots. A final lsblk/fdisk -l confirmed that both drives were back to \u0026ldquo;disk only\u0026rdquo; with no partitions or filesystems.\nWhat we are looking to see here is that neither disk has any partitions on it (which we do see). After all the machines are wiped, we are ready to move on.\nPower down the machines, remove the Ubuntu media and prep the Talos media.\nI should note. I have a dedicated JetKVM per Talos Node so I can easily view what the machine is doing, from the browser of my desktop computer. No need for KB/Mouse/Monitor on the actual machines. In addition, I run the JetKVM DC Power Control Extension so I can remotely power on and off the machines.\nRefreshing the Talos media All three nodes share the same Talos schematic (d009fe7b4f1bcd11a45d6ffd17e59921b0a33bc437eebb53cffb9a5b3b9e2992) which is baked into my existing talosconfig. That meant I could download the matching ISO straight from the factory and know it would have the right kernel modules, Thunderbolt NIC support, and SecureBoot bits:\n1 wget https://factory.talos.dev/image/d009fe7b4f1bcd11a45d6ffd17e59921b0a33bc437eebb53cffb9a5b3b9e2992/v1.11.5/metal-amd64.iso I still write it out with Rufus on a Windows machine because it\u0026rsquo;s the quickest way to prepare three bootable USB sticks. Live-boot Talos, point it at the controller VIP, and wait for maintenance mode.\nFor this I have 3 Identical Tesla 128GB USB Sticks (Two of which came out of our cars, the 3rd I grabbed from FB Marketplace)\ntalhelper, Task, and my bootstrap workflow Everything from this point lives in gavinmcfall/home-ops, which keeps Talos, Flux, and the apps in Git. Two files matter for bootstrap:\nkubernetes/bootstrap/talos/talconfig.yaml – the talhelper definition of the cluster (VIP 10.90.3.100, Thunderbolt cross-connects, bonded Intel NICs, etc.). .taskfiles/Talos/Taskfile.yaml – wraps the talhelper commands so I don\u0026rsquo;t fat-finger args. The talos:bootstrap task is essentially:\n1 2 3 4 5 6 7 8 9 task talos:bootstrap # expands to: talhelper gensecret \u0026gt; kubernetes/bootstrap/talos/talsecret.sops.yaml talhelper genconfig --config-file .../talconfig.yaml --secret-file .../talsecret.sops.yaml --out-dir .../clusterconfig talhelper gencommand apply --insecure | bash talhelper gencommand bootstrap | bash task talos:fetch-kubeconfig task talos:install-helm-apps talosctl health --server=false That installs Talos to all three NVMe devices, boots them into the control plane, then waits for etcd to converge. The follow-up install-helm-apps task runs kubernetes/bootstrap/helmfile.yaml, which deploys the CRDs Cilium needs, Cilium itself, CoreDNS, and Spegel so talosctl can start reporting node health correctly.\nBecause the Talos VIP and Thunderbolt-only links are described directly inside talconfig.yaml, there is nothing manual to do after the ISO is written. The KubePrism SAN, bonded NIC configuration, and node-specific Thunderbolt routes are reapplied automatically. That\u0026rsquo;s the magic of keeping talconfig and the cluster secrets in Git.\nDoing a dry run Before I actually pipe talhelper gencommand apply into bash, I like to rehearse the entire sequence so I know Talos will accept the configs:\n1 cd /home/gavin/home-ops/kubernetes/bootstrap/talos 1. Generate configs (safe, local files only) 1 talhelper genconfig --config-file talconfig.yaml --secret-file talsecret.sops.yaml --out-dir clusterconfig Output:\n1 2 3 4 5 generated config for stanton-01 in clusterconfig/home-kubernetes-stanton-01.yaml generated config for stanton-02 in clusterconfig/home-kubernetes-stanton-02.yaml generated config for stanton-03 in clusterconfig/home-kubernetes-stanton-03.yaml generated client config in clusterconfig/talosconfig generated .gitignore file in clusterconfig/.gitignore 2. Preview the apply/bootstrap commands 1 talhelper gencommand apply --extra-flags=\u0026#34;--insecure\u0026#34; --config-file talconfig.yaml --out-dir clusterconfig Output:\n1 2 3 talosctl apply-config --talosconfig=clusterconfig/talosconfig --nodes=10.90.3.101 --file=clusterconfig/home-kubernetes-stanton-01.yaml --insecure; talosctl apply-config --talosconfig=clusterconfig/talosconfig --nodes=10.90.3.102 --file=clusterconfig/home-kubernetes-stanton-02.yaml --insecure; talosctl apply-config --talosconfig=clusterconfig/talosconfig --nodes=10.90.3.103 --file=clusterconfig/home-kubernetes-stanton-03.yaml --insecure; 3. Preview bootstrap command 1 talhelper gencommand bootstrap --config-file talconfig.yaml --out-dir clusterconfig Output:\n1 talosctl bootstrap --talosconfig=clusterconfig/talosconfig --nodes=10.90.3.101; This is expected to only show one node, as Bootstrap initializes the cluster on a single node (stanton-01 in this case) and then the other nodes join it\n4. Dry-run each node 1 2 3 talosctl apply-config --insecure --nodes 10.90.3.101 --file clusterconfig/home-kubernetes-stanton-01.yaml --dry-run talosctl apply-config --insecure --nodes 10.90.3.102 --file clusterconfig/home-kubernetes-stanton-02.yaml --dry-run talosctl apply-config --insecure --nodes 10.90.3.103 --file clusterconfig/home-kubernetes-stanton-03.yaml --dry-run Output:\n1 2 Dry run summary: Node is running in maintenance mode and does not have a config yet. If like me you just swap the IP address when typing the command again and forget to swap the file name\u0026hellip; e.g. talosctl apply-config --insecure --nodes 10.90.3.102 --file clusterconfig/home-kubernetes-stanton-01.yaml --dry-run\nYou will get an error:\n1 2 3 4 talosctl apply-config --insecure --nodes 10.90.3.102 --file clusterconfig/home-kubernetes-stanton-01.yaml --dry-run error applying new configuration: rpc error: code = InvalidArgument desc = runtime configuration validation failed: 1 error occurred: * no disks matched the expression: glob(\u0026#34;S73VNU0X303066H\u0026#34;, disk.serial) \u0026amp;\u0026amp; disk.transport != \u0026#34;\u0026#34; \u0026amp;\u0026amp; !disk.readonly \u0026amp;\u0026amp; !disk.cdrom Make sure if you are going to be lazy, you are also accurate — the dry run is there to catch mistakes before Talos ever touches a disk.\ntalhelper genconfig prints which files were created, while the gencommand invocations show the exact talosctl commands that task talos:bootstrap is about to execute. The last step validates that each node is sitting in maintenance mode and that Talos can match the disk selector defined for that specific host. Once everything looks good it’s time to let the task rip.\nInstall Time #ifYouKnowYouKnow\nThe entire process took 6 minutes from the time I typed task talos:bootstrap to the time it was done. You will see connection errors and lots of other \u0026ldquo;scary\u0026rdquo; looking things, this is just because this process polls for things that are not \u0026ldquo;up\u0026rdquo; yet. Just wait, go make a drink, have a cookie.\nRunning task talos:bootstrap 1 2 cd /home/gavin/home-ops task talos:bootstrap task runs every command defined in .taskfiles/Talos/Taskfile.yaml: it (re)generates Talos secrets if they’re missing, renders configs, applies them to each node, performs the one-time bootstrap on stanton-01, downloads the kubeconfig, and applies the bootstrap Helm apps (Cilium/CoreDNS/Spegel). If something fails midway I can rerun the task and it will pick up where it left off.\nOutput:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 task talos:bootstrap task: [talos:bootstrap] if [ ! -f \u0026#34;/home/gavin/home-ops/kubernetes/bootstrap/talos/talsecret.sops.yaml\u0026#34; ]; then talhelper gensecret \u0026gt; /home/gavin/home-ops/kubernetes/bootstrap/talos/talsecret.sops.yaml sops --encrypt --in-place /home/gavin/home-ops/kubernetes/bootstrap/talos/talsecret.sops.yaml fi task: [talos:bootstrap] talhelper genconfig --config-file /home/gavin/home-ops/kubernetes/bootstrap/talos/talconfig.yaml --secret-file /home/gavin/home-ops/kubernetes/bootstrap/talos/talsecret.sops.yaml --out-dir /home/gavin/home-ops/kubernetes/bootstrap/talos/clusterconfig generated config for stanton-01 in /home/gavin/home-ops/kubernetes/bootstrap/talos/clusterconfig/home-kubernetes-stanton-01.yaml generated config for stanton-02 in /home/gavin/home-ops/kubernetes/bootstrap/talos/clusterconfig/home-kubernetes-stanton-02.yaml generated config for stanton-03 in /home/gavin/home-ops/kubernetes/bootstrap/talos/clusterconfig/home-kubernetes-stanton-03.yaml generated client config in /home/gavin/home-ops/kubernetes/bootstrap/talos/clusterconfig/talosconfig generated .gitignore file in /home/gavin/home-ops/kubernetes/bootstrap/talos/clusterconfig/.gitignore task: [talos:bootstrap] talhelper gencommand apply --extra-flags=\u0026#34;--insecure\u0026#34; --config-file /home/gavin/home-ops/kubernetes/bootstrap/talos/talconfig.yaml --out-dir /home/gavin/home-ops/kubernetes/bootstrap/talos/clusterconfig | bash task: [talos:bootstrap] until talhelper gencommand bootstrap --config-file /home/gavin/home-ops/kubernetes/bootstrap/talos/talconfig.yaml --out-dir /home/gavin/home-ops/kubernetes/bootstrap/talos/clusterconfig | bash; do sleep 10; done error executing bootstrap: rpc error: code = Unavailable desc = last connection error: connection error: desc = \u0026#34;transport: Error while dialing: dial tcp 10.90.3.102:50000: connect: connection refused\u0026#34; error executing bootstrap: rpc error: code = Unavailable desc = last connection error: connection error: desc = \u0026#34;transport: Error while dialing: dial tcp 10.90.3.102:50000: connect: connection refused\u0026#34; error executing bootstrap: rpc error: code = Unavailable desc = last connection error: connection error: desc = \u0026#34;transport: Error while dialing: dial tcp 10.90.3.103:50000: connect: connection refused\u0026#34; error executing bootstrap: rpc error: code = Unavailable desc = last connection error: connection error: desc = \u0026#34;transport: Error while dialing: dial tcp 10.90.3.101:50000: connect: connection refused\u0026#34; error executing bootstrap: rpc error: code = Unavailable desc = last connection error: connection error: desc = \u0026#34;transport: Error while dialing: dial tcp 10.90.3.102:50000: connect: connection refused\u0026#34; error executing bootstrap: rpc error: code = Unavailable desc = last connection error: connection error: desc = \u0026#34;transport: Error while dialing: dial tcp 10.90.3.102:50000: connect: connection refused\u0026#34; error executing bootstrap: rpc error: code = Unavailable desc = last connection error: connection error: desc = \u0026#34;transport: Error while dialing: dial tcp 10.90.3.101:50000: connect: connection refused\u0026#34; error executing bootstrap: rpc error: code = Unavailable desc = last connection error: connection error: desc = \u0026#34;transport: Error while dialing: dial tcp 10.90.3.103:50000: connect: connection refused\u0026#34; error executing bootstrap: rpc error: code = Unavailable desc = last connection error: connection error: desc = \u0026#34;transport: Error while dialing: dial tcp 10.90.3.102:50000: connect: connection refused\u0026#34; error executing bootstrap: rpc error: code = Unavailable desc = last connection error: connection error: desc = \u0026#34;transport: Error while dialing: dial tcp 10.90.3.101:50000: connect: connection refused\u0026#34; error executing bootstrap: rpc error: code = Unavailable desc = last connection error: connection error: desc = \u0026#34;transport: Error while dialing: dial tcp 10.90.3.101:50000: connect: connection refused\u0026#34; error executing bootstrap: rpc error: code = Unavailable desc = last connection error: connection error: desc = \u0026#34;transport: Error while dialing: dial tcp 10.90.3.102:50000: connect: no route to host\u0026#34; task: [talos:fetch-kubeconfig] until talhelper gencommand kubeconfig --config-file /home/gavin/home-ops/kubernetes/bootstrap/talos/talconfig.yaml --out-dir /home/gavin/home-ops/kubernetes/bootstrap/talos/clusterconfig --extra-flags=\u0026#34;/home/gavin/home-ops --force\u0026#34; | bash; do sleep 10; done task: [talos:install-helm-apps] until kubectl --kubeconfig /home/gavin/home-ops/kubeconfig wait --for=condition=Ready=False nodes --all --timeout=600s; do sleep 10; done Unable to connect to the server: dial tcp 10.90.3.100:6443: connect: no route to host The connection to the server 10.90.3.100:6443 was refused - did you specify the right host or port? The connection to the server 10.90.3.100:6443 was refused - did you specify the right host or port? The connection to the server 10.90.3.100:6443 was refused - did you specify the right host or port? error: no matching resources found error: no matching resources found error: no matching resources found error: no matching resources found error: no matching resources found error: no matching resources found node/stanton-01 condition met task: [talos:install-helm-apps] helmfile --kubeconfig /home/gavin/home-ops/kubeconfig --file /home/gavin/home-ops/kubernetes/bootstrap/helmfile.yaml apply --skip-diff-on-install --suppress-diff Adding repo cilium https://helm.cilium.io \u0026#34;cilium\u0026#34; has been added to your repositories Adding repo coredns https://coredns.github.io/helm \u0026#34;coredns\u0026#34; has been added to your repositories Pulling ghcr.io/prometheus-community/charts/prometheus-operator-crds:24.0.2 Pulling ghcr.io/spegel-org/helm-charts/spegel:0.5.1 Listing releases matching ^prometheus-operator-crds$ Listing releases matching ^coredns$ Listing releases matching ^spegel$ Listing releases matching ^cilium$ Upgrading release=prometheus-operator-crds, chart=/tmp/helmfile4191493076/observability/prometheus-operator-crds/prometheus-operator-crds/24.0.2/prometheus-operator-crds Release \u0026#34;prometheus-operator-crds\u0026#34; does not exist. Installing it now. NAME: prometheus-operator-crds LAST DEPLOYED: Wed Nov 19 10:22:06 2025 NAMESPACE: observability STATUS: deployed REVISION: 1 TEST SUITE: None Listing releases matching ^prometheus-operator-crds$ prometheus-operator-crds observability 1 2025-11-19 10:22:06.42745174 +1300 NZDT deployed prometheus-operator-crds-24.0.2 v0.86.2 Upgrading release=cilium, chart=cilium/cilium Release \u0026#34;cilium\u0026#34; does not exist. Installing it now. NAME: cilium LAST DEPLOYED: Wed Nov 19 10:22:10 2025 NAMESPACE: kube-system STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: You have successfully installed Cilium. Your release version is 1.18.3. For any further help, visit https://docs.cilium.io/en/v1.18/gettinghelp Listing releases matching ^cilium$ cilium kube-system 1 2025-11-19 10:22:10.403756432 +1300 NZDT deployed cilium-1.18.3 1.18.3 Upgrading release=coredns, chart=coredns/coredns Release \u0026#34;coredns\u0026#34; does not exist. Installing it now. NAME: coredns LAST DEPLOYED: Wed Nov 19 10:23:23 2025 NAMESPACE: kube-system STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: CoreDNS is now running in the cluster as a cluster-service. It can be tested with the following: 1. Launch a Pod with DNS tools: kubectl run -it --rm --restart=Never --image=infoblox/dnstools:latest dnstools 2. Query the DNS server: / # host kubernetes Listing releases matching ^coredns$ coredns kube-system 1 2025-11-19 10:23:23.797590616 +1300 NZDT deployed coredns-1.45.0 1.13.1 Upgrading release=spegel, chart=/tmp/helmfile4191493076/kube-system/spegel/spegel/0.5.1/spegel Release \u0026#34;spegel\u0026#34; does not exist. Installing it now. NAME: spegel LAST DEPLOYED: Wed Nov 19 10:23:26 2025 NAMESPACE: kube-system STATUS: deployed REVISION: 1 TEST SUITE: None Listing releases matching ^spegel$ spegel kube-system 1 2025-11-19 10:23:26.188082749 +1300 NZDT deployed spegel-0.5.1 v0.5.1 UPDATED RELEASES: NAME NAMESPACE CHART VERSION DURATION prometheus-operator-crds observability oci://ghcr.io/prometheus-community/charts/prometheus-operator-crds 24.0.2 4s cilium kube-system cilium/cilium 1.18.3 1m13s coredns kube-system coredns/coredns 1.45.0 3s spegel kube-system oci://ghcr.io/spegel-org/helm-charts/spegel 0.5.1 46s task: [talos:install-helm-apps] until kubectl --kubeconfig /home/gavin/home-ops/kubeconfig wait --for=condition=Ready nodes --all --timeout=600s; do sleep 10; done node/stanton-01 condition met node/stanton-02 condition met node/stanton-03 condition met task: [talos:bootstrap] talosctl health --server=false waiting for etcd to be healthy: OK waiting for etcd members to be consistent across nodes: OK waiting for etcd members to be control plane nodes: OK waiting for apid to be ready: OK waiting for all nodes memory sizes: OK waiting for all nodes disk sizes: OK waiting for kubelet to be healthy: OK waiting for all nodes to finish boot sequence: OK waiting for all k8s nodes to report: OK waiting for all k8s nodes to report ready: OK waiting for all control plane static pods to be running: OK waiting for all control plane components to be ready: OK waiting for kube-proxy to report ready: SKIP waiting for coredns to report ready: OK waiting for all k8s nodes to report schedulable: OK Can I see and talk to the cluster? In your terminal, run:\n1 talosctl dashboard You can use your keyboard\u0026rsquo;s left and right arrows to cycle through the nodes. In addition I run k9s and can run this to see my cluster\n1 k9s The default page takes me to a view of all pods across all namespaces\nHolding Shift and pressing ; will type a : which opens the k9s menu. From here you can type nodes and press enter to see the nodes\nRehydrating GitOps state Before I do this next step I want to make sure the majority of my cluster is \u0026ldquo;Off\u0026rdquo; in Git. My repo layout is\nhome-ops/kubernetes/apps/\u0026lt;namespace\u0026gt;\nInside each namespace folder is a top-level kustomization.yaml e.g.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 --- apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization resources: # Pre Flux-Kustomizations - ./namespace.yaml # Flux-Kustomizations # - ./gatus/ks.yaml # - ./grafana/ks.yaml # - ./kromgo/ks.yaml # - ./kube-prometheus-stack/ks.yaml # - ./loki/ks.yaml # - ./network-ups-tools/ks.yaml # - ./prometheus-operator-crds/ks.yaml # - ./promtail/ks.yaml # - ./redisinsight/ks.yaml # - ./unpoller/ks.yaml # Exporters # - ./exporters You can see here I have commented out everything except the creation of the namespace. This means it won\u0026rsquo;t initially deploy this when I task flux:bootstrap\nI did this for every namespace except\ncert-manager external-secrets flux-system kube-system network openebs-system volsync-system Once Talos hands back a kubeconfig (stored at kubernetes/bootstrap/talos/clusterconfig/talosconfig), Flux gets bootstrapped with another task:\n1 task flux:bootstrap Output:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 task flux:bootstrap task: [flux:bootstrap] kubectl apply --kubeconfig /home/gavin/home-ops/kubeconfig --server-side --kustomize /home/gavin/home-ops/kubernetes/bootstrap/flux namespace/flux-system serverside-applied resourcequota/critical-pods-flux-system serverside-applied customresourcedefinition.apiextensions.k8s.io/alerts.notification.toolkit.fluxcd.io serverside-applied customresourcedefinition.apiextensions.k8s.io/buckets.source.toolkit.fluxcd.io serverside-applied customresourcedefinition.apiextensions.k8s.io/externalartifacts.source.toolkit.fluxcd.io serverside-applied customresourcedefinition.apiextensions.k8s.io/gitrepositories.source.toolkit.fluxcd.io serverside-applied customresourcedefinition.apiextensions.k8s.io/helmcharts.source.toolkit.fluxcd.io serverside-applied customresourcedefinition.apiextensions.k8s.io/helmreleases.helm.toolkit.fluxcd.io serverside-applied customresourcedefinition.apiextensions.k8s.io/helmrepositories.source.toolkit.fluxcd.io serverside-applied customresourcedefinition.apiextensions.k8s.io/imagepolicies.image.toolkit.fluxcd.io serverside-applied customresourcedefinition.apiextensions.k8s.io/imagerepositories.image.toolkit.fluxcd.io serverside-applied customresourcedefinition.apiextensions.k8s.io/imageupdateautomations.image.toolkit.fluxcd.io serverside-applied customresourcedefinition.apiextensions.k8s.io/kustomizations.kustomize.toolkit.fluxcd.io serverside-applied customresourcedefinition.apiextensions.k8s.io/ocirepositories.source.toolkit.fluxcd.io serverside-applied customresourcedefinition.apiextensions.k8s.io/providers.notification.toolkit.fluxcd.io serverside-applied customresourcedefinition.apiextensions.k8s.io/receivers.notification.toolkit.fluxcd.io serverside-applied serviceaccount/helm-controller serverside-applied serviceaccount/image-automation-controller serverside-applied serviceaccount/image-reflector-controller serverside-applied serviceaccount/kustomize-controller serverside-applied serviceaccount/notification-controller serverside-applied serviceaccount/source-controller serverside-applied clusterrole.rbac.authorization.k8s.io/crd-controller-flux-system serverside-applied clusterrole.rbac.authorization.k8s.io/flux-edit-flux-system serverside-applied clusterrole.rbac.authorization.k8s.io/flux-view-flux-system serverside-applied clusterrolebinding.rbac.authorization.k8s.io/cluster-reconciler-flux-system serverside-applied clusterrolebinding.rbac.authorization.k8s.io/crd-controller-flux-system serverside-applied service/notification-controller serverside-applied service/source-controller serverside-applied service/webhook-receiver serverside-applied deployment.apps/helm-controller serverside-applied deployment.apps/image-automation-controller serverside-applied deployment.apps/image-reflector-controller serverside-applied deployment.apps/kustomize-controller serverside-applied deployment.apps/notification-controller serverside-applied deployment.apps/source-controller serverside-applied task: [flux:bootstrap] cat /home/gavin/home-ops/age.key | kubectl --kubeconfig /home/gavin/home-ops/kubeconfig -n flux-system create secret generic sops-age --from-file=age.agekey=/dev/stdin secret/sops-age created task: [flux:bootstrap] sops --decrypt /home/gavin/home-ops/kubernetes/flux/vars/cluster-secrets.sops.yaml | kubectl apply --kubeconfig /home/gavin/home-ops/kubeconfig --server-side --filename - secret/cluster-secrets serverside-applied task: [flux:bootstrap] kubectl apply --kubeconfig /home/gavin/home-ops/kubeconfig --server-side --filename /home/gavin/home-ops/kubernetes/flux/vars/cluster-settings.yaml configmap/cluster-settings serverside-applied task: [flux:bootstrap] kubectl apply --kubeconfig /home/gavin/home-ops/kubeconfig --server-side --kustomize /home/gavin/home-ops/kubernetes/flux/config kustomization.kustomize.toolkit.fluxcd.io/cluster serverside-applied kustomization.kustomize.toolkit.fluxcd.io/flux serverside-applied gitrepository.source.toolkit.fluxcd.io/home-kubernetes serverside-applied ocirepository.source.toolkit.fluxcd.io/flux-manifests serverside-applied That command:\nApplies the manifests under kubernetes/bootstrap/flux, including the deploy key secret. Decrypts and applies flux/vars/cluster-secrets.sops.yaml. Applies cluster settings and Kustomizations per namespace (kubernetes/apps/\u0026lt;namespace\u0026gt;/\u0026lt;app\u0026gt;). Disaster\u0026hellip;Kinda Because I am now cut over to Envoy Gateway and not ingress-nginx I am missing the Gateway CRDs because Cilium does not install them (I\u0026rsquo;m not using Cilium for Gateway API, I\u0026rsquo;m using Envoy)\nNow I could just kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/standard-install.yaml and solve the problem\u0026hellip; But that\u0026rsquo;s not IAC\nSo, I created a new folder home-ops/kubernetes/apps/network/gateway-api\n1 2 3 4 . ├── app │ └── kustomization.yaml └── ks.yaml That kustomization.yaml is fairly simple. Its job is to install the Gateway API CRDs\n1 2 3 4 5 --- apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization resources: - https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/standard-install.yaml owever, I need these installed before Envoy so Inkubernetes/apps/network/envoy-gateway/ks.yaml I had to add a dependsOn:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 --- # yaml-language-server: $schema=https://kubernetes-schemas.ok8.sh/kustomize.toolkit.fluxcd.io/kustomization_v1.json apiVersion: kustomize.toolkit.fluxcd.io/v1 kind: Kustomization metadata: name: \u0026amp;app envoy-gateway namespace: flux-system spec: targetNamespace: \u0026amp;namespace network commonMetadata: labels: app.kubernetes.io/name: *app interval: 1h path: ./kubernetes/apps/network/envoy-gateway/app prune: true sourceRef: kind: GitRepository name: home-kubernetes timeout: 15m dependsOn: - name: gateway-api Now this comes up first, then Envoy.\nSometimes things come up a little out of order, especially where CRDs are concerned. best bet is to check things and reconcile if needed. Example\nenvoy-gateway came up (the main Helm Release and installed the CRDs) but the dry run failed for envoy-gateway-config\nAll I had to do is:\n1 flux reconcile kustomization envoy-gateway-config --with-source And the Ks Applied Revision and the rest of Envoy rolled out\nRook-Ceph DR Recovery: Why It Wouldn\u0026rsquo;t Start Issue 1: Cleanup Policy Blocking Orchestration Symptom: Operator logs showed:\n1 skipping orchestration for cluster object \u0026#34;rook-ceph\u0026#34; because its cleanup policy is set Cause: The CephCluster spec had cleanupPolicy.confirmation: yes-really-destroy-data set. When this field is present, Rook interprets it as \u0026ldquo;this cluster is being deleted\u0026rdquo; and skips all orchestration entirely. Fix: Removed from helmrelease.yaml:\nconfirmation: yes-really-destroy-data wipeDevicesFromOtherClusters: true Since the HelmRelease was stuck reinstalling, we also patched directly:\n1 2 3 kubectl patch cephcluster -n rook-ceph rook-ceph --type=json \\ -p=\u0026#39;[{\u0026#34;op\u0026#34;: \u0026#34;remove\u0026#34;, \u0026#34;path\u0026#34;: \u0026#34;/spec/cleanupPolicy/confirmation\u0026#34;}, {\u0026#34;op\u0026#34;: \u0026#34;remove\u0026#34;, \u0026#34;path\u0026#34;: \u0026#34;/spec/cleanupPolicy/wipeDevicesFromOtherClusters\u0026#34;}]\u0026#39; Issue 2: OSD Network Configuration Failure Symptom: OSD pods in CrashLoopBackOff with:\n1 2 unable to find any IPv4 address in networks \u0026#39;169.254.255.0/24\u0026#39; interfaces \u0026#39;\u0026#39; Failed to pick cluster address. Cause: The Ceph cluster was configured to use 169.254.255.0/24 (Thunderbolt mesh) for cluster traffic, but the Thunderbolt interfaces weren\u0026rsquo;t up - the kernel showed timeout errors trying to communicate with the Thunderbolt controller. Fix:\nReplugged the Thunderbolt cables between nodes Verified interfaces came up with correct IPs (169.254.255.101/32) Temporarily patched cluster network to use main network while Thunderbolt was down: 1 2 kubectl patch cephcluster -n rook-ceph rook-ceph --type=json \\ -p=\u0026#39;[{\u0026#34;op\u0026#34;: \u0026#34;replace\u0026#34;, \u0026#34;path\u0026#34;: \u0026#34;/spec/network/addressRanges/cluster\u0026#34;, \u0026#34;value\u0026#34;: [\u0026#34;10.90.0.0/16\u0026#34;]}]\u0026#39; Lessons Learned cleanupPolicy.confirmation is a deletion flag - Only set it when you actually want to destroy the cluster Check physical connectivity - Thunderbolt timeouts in dmesg indicate cable/connection issues Network config must match available interfaces - OSDs can\u0026rsquo;t start if they can\u0026rsquo;t find an IP in the configured cluster network range The layout change that started this mess is still worth doing, so I\u0026rsquo;m re-introducing it carefully:\nEach namespace owns its Flux Kustomization (ks.yaml in the namespace folder). Reconciles happen namespace-by-namespace with task flux:apply path=rook-ceph/cluster ns=rook-ceph. Critical infrastructure (Talos bootstrap Helm releases + Flux) never reference manifests outside their namespace so I can fence blast-radius. Once Flux is healthy I let VolSync and CloudNativePG pull data back from object storage, which rebuilds PVCs long before Rook redeploys fresh OSDs.\nNext Steps Roll out Database Namespace My Database Namespace contains\nCloudnative PG (Postgres) Dragonfly DB (Redis) - Currently blocking the rollout of OAuth2Proxy which depends on this Maria DB (My SQL) Mosquitto (MQTT Message Broker) For this, I will edit the kustomization.yaml in the Database namespace to uncomment the paths\n1 2 3 4 5 6 7 8 9 10 11 12 --- # yaml-language-server: $schema=https://json.schemastore.org/kustomization apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization resources: # Pre Flux-Kustomizations - ./namespace.yaml # Flux-Kustomizations - ./cloudnative-pg/ks.yaml - ./dragonfly/ks.yaml - ./mosquitto/ks.yaml - ./mariadb/ks.yaml Postgres restore In Backblaze I have my Postgres Backups\nand in my Cloudnative Postgres cluster config I have my restore settings: kubernetes/apps/database/cloudnative-pg/cluster17/cluster17.yaml\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 backup: retentionPolicy: 60d barmanObjectStore: \u0026amp;barmanObjectStore data: compression: bzip2 wal: compression: bzip2 maxParallel: 8 destinationPath: s3://nerdz-cloudnative-pg/ endpointURL: https://s3.us-east-005.backblazeb2.com serverName: \u0026amp;currentCluster postgres17-v3 s3Credentials: accessKeyId: name: cloudnative-pg-secret key: aws-access-key-id secretAccessKey: name: cloudnative-pg-secret key: aws-secret-access-key # Note: externalClusters is needed when recovering from an existing cnpg cluster bootstrap: recovery: source: \u0026amp;previousCluster postgres17-v2 # Note: externalClusters is needed when recovering from an existing cnpg cluster externalClusters: - name: *previousCluster barmanObjectStore: \u0026lt;\u0026lt;: *barmanObjectStore serverName: *previousCluster This tells Postgres to grab the data from postgres17-v2 in backblaze and to build a new cluster called postgres17-v3 and sync that to backblaze.\nMariaDB: when charts won\u0026rsquo;t behave MariaDB should have been the easy part. The upgrade from Bitnami’s 22.x chart to 23.x surfaced two separate gotchas:\nBitnami yanked the old 12.0.2-debian-12-r0 image tags, so the kubelet couldn’t pull the default container anymore. The new chart force-enables FIPS templating and renders two env sections in the mysqld-exporter sidecar whenever metrics are on, which makes Flux’s post-render stage blow up before anything gets applied. How I pulled it back:\nTemporarily pinned the MariaDB, volume-permissions, and mysqld-exporter containers to known-good digests so the cluster stayed alive while I figured out 23.x. Added global.defaultFips: \u0026quot;off\u0026quot; plus explicit \u0026quot;off\u0026quot; overrides for primary, volumePermissions, and metrics FIPS blocks so Helm would render the new manifests. Disabled the chart’s built-in metrics and stood up my own exporter Deployment + Service + ServiceMonitor (now in kubernetes/apps/observability/exporters/mariadb-exporter). It still runs in the database namespace, reads mariadb-secret, and Prometheus keeps scraping through the ServiceMonitor. Once everything was committed, I suspended/resumed the HelmRelease to clear Flux’s stuck rollout and let the StatefulSet recreate cleanly. It’s not glamorous, but it gets MariaDB onto chart 23.2.4 without losing metrics, and every moving piece is now in Git so the next rebuild will just be another task flux:apply.\nContinuation - More namespaces coming online Now that MariaDB was sorted I moved on to enabling the next lot of namespaces\nObservability (exporters, gatus, grafana, kromgo, kube-prometheus-stack, loki, network-ups-tools, promtail, redis insights and unpoller) Then the security namespace (pocket-id) This allowed me to kick the oauth2proxy pods which were crashlooping and get them running again except\u0026hellip; :sob: oauth2-proxy Hairpin NAT and the Gateway API Version Maze With pocket-id running I expected oauth2-proxy to come up cleanly. Instead all three replicas crashed with OIDC discovery timeouts trying to reach id.${SECRET_DOMAIN}. The culprit: hairpin NAT. Pods on the same node as the external gateway couldn\u0026rsquo;t reach its LoadBalancer IP (10.90.3.201) and curl just hung.\nThe classic fix is split-horizon DNS—have CoreDNS forward queries for your domain to an internal resolver that returns ClusterIPs instead of the external gateway. I already had k8s-gateway deployed at 10.96.100.130, so I added a server block to CoreDNS:\n1 2 3 4 5 6 7 8 servers: - zones: - zone: ${SECRET_DOMAIN} scheme: dns:// port: 53 plugins: - name: forward parameters: . 10.96.100.130 That should have been the end of it, but k8s-gateway refused to serve queries:\n1 2 plugin/k8s_gateway: Could not sync required resources failed to list *v1alpha2.GRPCRoute k8s-gateway v0.4.0 hardcodes a watch on v1alpha2.GRPCRoute, but I was running Gateway API v1.2.0 which only ships v1.GRPCRoute. The experimental bundle doesn\u0026rsquo;t help—v1alpha2 was removed in v1.1.0+.\nThe fix required threading a needle between multiple version requirements:\nEnvoy Gateway v1.6.0 needs BackendTLSPolicy at API version v1, which only exists in Gateway API ≥v1.4.0 k8s-gateway v0.4.0 needs v1alpha2.GRPCRoute, which was removed in Gateway API v1.1.0+ The solution is to use Gateway API v1.4.0 experimental (for BackendTLSPolicy v1) and patch the GRPCRoute CRD to add v1alpha2 back. See the \u0026ldquo;Permanent Fix: JSON Patch for v1alpha2.GRPCRoute\u0026rdquo; section below for the full solution.\nBut I wasn\u0026rsquo;t done. k8s-gateway was only watching Ingress and Service resources. pocket-id uses Gateway API HTTPRoutes for ingress, so k8s-gateway had no idea id.${SECRET_DOMAIN} existed:\n1 2 # kubernetes/apps/network/k8s-gateway/app/helmrelease.yaml watchedResources: [\u0026#34;Ingress\u0026#34;, \u0026#34;Service\u0026#34;, \u0026#34;HTTPRoute\u0026#34;] One more gotcha: deleting and recreating the Gateway API CRDs wiped out all existing HTTPRoute resources, and Helm didn\u0026rsquo;t know to recreate them since the release checksum hadn\u0026rsquo;t changed. I had to manually reapply from the helm manifest:\n1 helm get manifest -n security pocket-id | kubectl apply -f - After all that, DNS finally resolved:\n1 2 3 $ nslookup id.${SECRET_DOMAIN} Name: id.${SECRET_DOMAIN} Address: 10.90.3.201 And oauth2-proxy came up healthy:\n1 2 3 4 NAME READY STATUS RESTARTS AGE oauth2-proxy-f7f87f84b-lhwx9 1/1 Running 0 28s oauth2-proxy-f7f87f84b-q4n7s 1/1 Running 0 28s oauth2-proxy-f7f87f84b-t5j6l 1/1 Running 0 28s Gateway API Version Hell: BackendTLSPolicy and WebSocket Drama Just when I thought I was done, external services started failing. TrueNAS and Unifi were returning 400 Bad Request errors on WebSocket connections. The UI would load, but any real-time features (shell, charts, live updates) were dead.\nThe Problem Envoy Gateway v1.6.0 expects BackendTLSPolicy at API version gateway.networking.k8s.io/v1, but Gateway API v1.1.0 experimental only ships v1alpha3. Meanwhile, k8s-gateway v0.4.0 hardcodes watching for v1alpha2.GRPCRoute which doesn\u0026rsquo;t exist in newer Gateway API releases.\nThe operator logs were clear:\n1 BackendTLSPolicy CRD not found, skipping BackendTLSPolicy watch This meant Envoy wasn\u0026rsquo;t applying TLS settings to the backends at all.\nThe Fix (and the Trap) Upgrading Gateway API to v1.4.0 experimental brought in BackendTLSPolicy v1 (see the full kustomization.yaml with the GRPCRoute JSON patch in the \u0026ldquo;Permanent Fix\u0026rdquo; section below).\nUpdated the BackendTLSPolicy resources from v1alpha3 to v1, removed the namespace field from targetRefs (v1 doesn\u0026rsquo;t have it), and removed sectionName to apply TLS to all service ports.\nBut then WebSockets broke. I spent an hour chasing the wrong fix—removing useClientProtocol: true thinking it was sending HTTP/2 to HTTP/1.1-only backends. Wrong move.\nThe original working config (pre-wipe) had:\n1 2 spec: useClientProtocol: true This setting is essential for WebSocket support. When a client initiates an HTTP/1.1 WebSocket upgrade, Envoy needs to forward that same protocol to the backend. Removing it caused Envoy to use its default protocol (HTTP/2 with ALPN), which the backends rejected.\nWebSocket support is handled by useClientProtocol: true in the BackendTrafficPolicy—it allows Envoy to forward the client\u0026rsquo;s HTTP/1.1 upgrade request using the same protocol.\nPermanent Fix: JSON Patch for v1alpha2.GRPCRoute After fighting this multiple times (every Flux reconciliation would remove the manually-applied CRD), I finally found a permanent solution. The problem is that k8s-gateway v0.4.0 hardcodes a watch on v1alpha2.GRPCRoute, but Gateway API v1.1.0+ only ships v1.GRPCRoute.\nYou can\u0026rsquo;t just add the old CRD as a separate resource—Kustomize will fail with \u0026ldquo;may not add resource with an already registered id\u0026rdquo; because both define the same CRD name.\nThe fix is to use a JSON patch to append v1alpha2 as an additional version to the existing GRPCRoute CRD:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 # kubernetes/apps/network/gateway-api/app/kustomization.yaml --- apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization resources: - https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/experimental-install.yaml patches: # k8s-gateway v0.4.0 requires v1alpha2.GRPCRoute which was removed in Gateway API v1.1.0+ # Use JSON patch to ADD v1alpha2 version without replacing existing v1 - target: kind: CustomResourceDefinition name: grpcroutes.gateway.networking.k8s.io patch: |- - op: add path: /spec/versions/- value: name: v1alpha2 served: true storage: false schema: openAPIV3Schema: type: object properties: spec: type: object status: type: object This approach:\nPulls Gateway API v1.4.0 experimental (with v1.GRPCRoute and BackendTLSPolicy v1) Uses op: add with path: /spec/versions/- to append v1alpha2 to the versions array Keeps v1 as the storage version while serving v1alpha2 for k8s-gateway Now when Flux reconciles, k8s-gateway finds its required v1alpha2.GRPCRoute and stops complaining. No more manual patching after reconciliations.\nLessons learned:\nGateway API version compatibility is a minefield. k8s-gateway, Envoy Gateway, and external-dns all have different requirements. Pin to v1.4.0 experimental for BackendTLSPolicy v1 support with Envoy Gateway v1.6.0. Always add HTTPRoute to k8s-gateway\u0026rsquo;s watchedResources if you\u0026rsquo;re using Gateway API for ingress. CRD deletions cascade to all CR instances. After recreating CRDs, you need to force Helm to reapply resources even if the chart version hasn\u0026rsquo;t changed. Don\u0026rsquo;t remove useClientProtocol: true from BackendTrafficPolicy – it\u0026rsquo;s required for WebSocket connections to work properly. When in doubt, check the original working config before making changes. Use JSON patches to add CRD versions – when you need to support multiple API versions in a single CRD, use Kustomize JSON patches with op: add to append versions without replacing the original. Hairpin NAT and Cilium socketLB After all the Gateway API CRD drama, I hit another wall: pods couldn\u0026rsquo;t reach services on the external gateway. BookStack\u0026rsquo;s OIDC login to PocketID was timing out with cURL error 6: Could not resolve host: id.nerdz.cloud.\nThe symptoms were confusing:\nDNS resolved correctly via k8s-gateway (returning the LoadBalancer IP) The pod could reach the service directly via ClusterIP But any attempt to curl the LoadBalancer IP from inside the cluster timed out This is the classic hairpin NAT problem. With ingress-nginx, this worked because it runs with hostNetwork: true, so traffic never goes through the LoadBalancer service from inside the cluster. But Envoy Gateway deploys proxies as regular pods without hostNetwork, so internal traffic trying to reach the LoadBalancer IP gets stuck.\nThe fix? Enable Cilium\u0026rsquo;s socketLB. This allows pods to reach LoadBalancer IPs directly by doing socket-level interception:\n1 2 3 # kubernetes/apps/kube-system/cilium/app/helm-values.yaml socketLB: enabled: true Before this change, bpf-lb-sock was false in the cilium-config ConfigMap. After enabling it, pods can reach any LoadBalancer IP in the cluster, including the external gateway where PocketID lives.\nI also added an internal HTTPRoute to pocket-id as a belt-and-suspenders approach, so k8s-gateway returns both the internal and external gateway IPs. But the real fix was enabling socketLB.\nLesson learned: If you\u0026rsquo;re using Cilium with Envoy Gateway (or any ingress that doesn\u0026rsquo;t use hostNetwork), you need socketLB.enabled: true for pod-to-LoadBalancer connectivity.\nRook Ceph: clearing the “too many PGs per OSD” alert Even on the fresh Talos rebuild, Ceph immediately threw HEALTH_WARN too many PGs per OSD (265 \u0026gt; max 250). With only three NVMe-backed OSDs online, the default PG soft limit is tight, so any extra pools tip it over. The culprit was the bootstrap RGW realm (default) that Rook creates every time, even if you never intend to use it.\nWhat I did:\nConfirmed the warning was PG-related – kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status showed 265 PGs spread across 15 pools, all OSDs healthy.\nInspected pool usage – ceph df detail revealed default.rgw.(log|control|meta) sucking up 96 PGs despite having zero data.\nVerified the realm was unused – radosgw-admin realm/zonegroup/zone list only reported ceph-objectstore as active; the old default entries had no buckets or users (radosgw-admin bucket list --rgw-zone=default returned []).\nDeleted the bootstrap realm stack – removed the default zonegroup + zone, ran radosgw-admin period update --commit, then dropped the three empty pools with:\n1 2 3 ceph osd pool rm default.rgw.log default.rgw.log --yes-i-really-really-mean-it ceph osd pool rm default.rgw.control default.rgw.control --yes-i-really-really-mean-it ceph osd pool rm default.rgw.meta default.rgw.meta --yes-i-really-really-mean-it Rechecked cluster health – PG count fell to 169 and ceph status flipped back to HEALTH_OK.\nLesson learned: after every fresh Rook install, delete the unused default RGW realm or scale up OSDs before enabling your workloads. Otherwise Ceph wastes PGs on pools you don’t even mount, and you get an avoidable health warning the moment the cluster comes online.\nI\u0026rsquo;m still in the middle of rebuilding, but wiping the nodes and re-running the Talos + Flux bootstrap took under an hour once the ISO was ready. The longest part was downloading backups and letting VolSync rehydrate PVCs. If you ever reach the point where Rook is fighting ghost disks, don\u0026rsquo;t be afraid to pave the nodes. Talos makes the clean-room reinstall repeatable, and GitOps brings all of the workloads back on autopilot.\nDiscovering ImageVolume: OCI images as read-only config volumes While getting qbittorrent running, I noticed it was using something I didn\u0026rsquo;t know existed: the type: image persistence option in bjw-s\u0026rsquo;s app-template. It lets you mount an OCI image directly as a read-only volume—perfect for bundling binaries or config without maintaining separate ConfigMaps or init containers.\nThe qbittorrent helmrelease mounts qbrr (a torrent reannounce tool) this way:\n1 2 3 4 5 6 persistence: qbrr: type: image image: ghcr.io/buroa/qbrr:0.1.2@sha256:f930dbb4de49ffe3348d1d4f8187ce27590842bc4d6a89c3aa84234d7e99f46b globalMounts: - readOnly: true This pulls the OCI image and mounts its filesystem as a read-only volume—qbittorrent gets access to the qbrr binary without needing an init container to copy it over. The catch? It requires the Kubernetes ImageVolume feature gate, which is alpha and disabled by default.\nEnabling ImageVolume on Talos Enabling a Kubernetes feature gate on Talos means patching both the kubelet (on all nodes) and the API server (on control plane nodes). I followed MASTERBLASTER\u0026rsquo;s pattern from their home-ops repo.\nStep 1: Create the controller patch for the API server\nFile: kubernetes/bootstrap/talos/patches/controller/feature-gates.yaml\n1 2 3 4 cluster: apiServer: extraArgs: feature-gates: ImageVolume=true Step 2: Update the kubelet patch\nAdded to the existing kubernetes/bootstrap/talos/patches/global/kubelet.yaml:\n1 2 3 4 5 6 7 8 machine: kubelet: nodeIP: validSubnets: - 10.90.3.0/16 extraConfig: featureGates: ImageVolume: true Step 3: Reference the new patch in talconfig.yaml\nAdded \u0026quot;@./patches/controller/feature-gates.yaml\u0026quot; to the controlPlane patches list.\nRolling out the change safely Talos machineconfig changes can be risky—a bad config can brick your nodes. I took a cautious approach with full backups and validation at each step.\n1. Document current state\nFirst, verify no feature gates are currently configured:\n1 2 3 4 5 6 7 8 # Check kubelet feature gates on all nodes for node in 10.90.3.101 10.90.3.102 10.90.3.103; do echo \u0026#34;--- Node $node ---\u0026#34; talosctl -n $node get machineconfig -o yaml | grep -A2 featureGates done # Check API server feature gates kubectl get pod -n kube-system -l component=kube-apiserver -o yaml | grep feature-gates All nodes returned empty—no feature gates configured.\n2. Backup existing configs before changes\n1 2 3 4 5 6 7 8 cd /home/gavin/home-ops/kubernetes/bootstrap/talos # Copy current machineconfigs cp -r clusterconfig clusterconfig-before-imagevolume # Copy patch files we\u0026#39;re modifying cp patches/global/kubelet.yaml kubelet-before-imagevolume.yaml cp talconfig.yaml talconfig-before-imagevolume.yaml 3. Generate new configs and compare\n1 2 3 4 5 6 # Generate new machineconfigs talhelper genconfig # Compare old vs new (or use VS Code diff) diff clusterconfig-before-imagevolume/home-kubernetes-stanton-01.yaml \\ clusterconfig/home-kubernetes-stanton-01.yaml Here\u0026rsquo;s what the diffs look like in VS Code:\nKubelet patch – adds extraConfig.featureGates.ImageVolume: true:\nAPI server patch – adds extraArgs.feature-gates: ImageVolume=true:\n4. Dry-run first node\nTest the config without applying to catch any errors:\n1 2 export TALOSCONFIG=/home/gavin/home-ops/kubernetes/bootstrap/talos/clusterconfig/talosconfig talosctl apply-config -n 10.90.3.101 -f clusterconfig/home-kubernetes-stanton-01.yaml --dry-run The dry-run output showed exactly what would change:\n1 2 3 4 5 6 7 8 9 Dry run summary: Applied configuration without a reboot (skipped in dry-run). Config diff: + extraConfig: + featureGates: + ImageVolume: true ... + extraArgs: + feature-gates: ImageVolume=true 5. Apply one node at a time\nStart with the first node, wait for it to return to Ready, then proceed:\n1 2 3 4 5 6 7 8 # Apply to first node talosctl apply-config -n 10.90.3.101 -f clusterconfig/home-kubernetes-stanton-01.yaml # Watch for Ready status kubectl get nodes -w # Verify feature gate applied talosctl -n 10.90.3.101 get machineconfig -o yaml | grep -A2 featureGates 6. Apply to remaining nodes\nOnly after stanton-01 is confirmed healthy:\n1 2 3 4 5 6 7 # Second node talosctl apply-config -n 10.90.3.102 -f clusterconfig/home-kubernetes-stanton-02.yaml kubectl get nodes -w # Third node talosctl apply-config -n 10.90.3.103 -f clusterconfig/home-kubernetes-stanton-03.yaml kubectl get nodes -w All three nodes applied the config without requiring a reboot—Talos intelligently determines whether kubelet/API server changes need a full reboot or just a service restart.\nValidating the feature works With the feature gate enabled, I force-reconciled the qbittorrent helmrelease (the --force flag ensures Helm reapplies the deployment even if the chart version hasn\u0026rsquo;t changed):\n1 2 flux reconcile helmrelease -n downloads qbittorrent --force kubectl rollout status deployment/qbittorrent -n downloads --timeout=120s Once the pod came up, I verified the image volume was mounted and the binary was accessible:\n1 2 # Check the qbrr volume is mounted kubectl exec -n downloads deploy/qbittorrent -c app -- ls -la /qbrr/ Output shows the entire OCI image filesystem mounted read-only:\n1 2 3 4 5 6 7 total 2072 drwxr-xr-x 1 root root 6 Nov 20 12:03 . drwxr-xr-x 1 root root 40 Nov 20 12:03 .. drwxr-xr-x 2 root root 6 Aug 25 04:05 bin ... -rwxr-xr-x 1 root root 2118668 Nov 18 05:15 qbrr ... And verify the binary is executable:\n1 kubectl exec -n downloads deploy/qbittorrent -c app -- /qbrr/qbrr --help 1 2 3 4 5 6 7 8 9 Usage of /qbrr/qbrr: qBittorrent reannouncement tool -hash string Specific torrent hash to reannounce (single run mode) -interval int Interval between reannounce checks in seconds (default 7) ... The ImageVolume feature is working—the qbrr binary is available at /qbrr/qbrr without any init containers or volume copying.\nLesson learned: type: image persistence is a cleaner pattern for shipping binaries or config alongside your main container. You avoid init containers that copy files around, and the volume is immutable. The feature gate requirement means it\u0026rsquo;s not for everyone, but on a homelab where you control the cluster config, it\u0026rsquo;s worth enabling.\nGuardrails I\u0026rsquo;m putting in place Wipe before redeploying storage – running blkdiscard and wipefs is now part of the Taskfile talos:nuke flow so Ceph never trips on residual GPT headers again. Namespace-scoped Flux Kustomize – large-scale reorganizations happen behind feature branches and are reconciled one namespace at a time instead of flipping the entire cluster at once. Talos factory IDs in version control – keeping the Talos schematic IDs and Thunderbolt routes in talconfig.yaml meant the reinstall was deterministic. Future upgrades will keep those comments updated so I always know what ISO to pull. Document the scary steps – posts like this become my runbook. The next time Talos needs to be rebuilt I can follow the exact steps—Ubuntu live boot, blkdiscard, task talos:bootstrap, task flux:bootstrap—without searching through old Discord threads. Update 2025-11-24: Overseerr Connectivity Issues After Gateway API Migration After completing the Envoy Gateway migration, I discovered that Overseerr couldn\u0026rsquo;t communicate with any of the sonarr/radarr/plex instances. The logs showed consistent timeout errors:\n1 2 [error][Download Tracker]: Unable to get queue from Sonarr server: Sonarr UHD connect ETIMEDOUT 10.90.3.202:80 The Root Cause Before the migration to Gateway API routes, Overseerr was configured to use external domain names (e.g., sonarr.${SECRET_DOMAIN}, radarr-uhd.${SECRET_DOMAIN}). These domains were resolving via k8s-gateway DNS to the envoy-internal LoadBalancer IP 10.90.3.202.\nThe problem? Pods cannot reach LoadBalancer IPs directly when using Cilium\u0026rsquo;s socketLB in certain configurations. While I had enabled socketLB to fix the hairpin NAT issue for external services, it wasn\u0026rsquo;t working for all pod-to-LoadBalancer scenarios.\nTesting revealed:\n✅ Direct cluster DNS (sonarr.downloads.svc.cluster.local:80) - WORKS ✅ ClusterIP access (10.96.251.188:80) - WORKS ❌ LoadBalancer IP (10.90.3.202:80) - TIMES OUT ❌ External domains (resolve to LoadBalancer IP) - FAILS The Solution The fix is simple: use internal cluster DNS names instead of external domains. Since all the *arr services run in the downloads namespace and Overseerr runs in entertainment, I needed to update the Overseerr configuration to use fully-qualified cluster DNS names.\nUpdate Overseerr (Settings → Services) with these internal service URLs:\nSonarr instances:\nSonarr (Default): http://sonarr.downloads.svc.cluster.local port 80 Sonarr Horror: http://sonarr.downloads.svc.cluster.local port 80 Sonarr UHD (Default 4K): http://sonarr-uhd.downloads.svc.cluster.local port 80 Sonarr Foreign: http://sonarr-foreign.downloads.svc.cluster.local port 80 Radarr instances:\nRadarr (Default): http://radarr.downloads.svc.cluster.local port 80 Radarr UHD (Default 4K): http://radarr-uhd.downloads.svc.cluster.local port 80 ** Plex instances:**\nPlex: http://plex.entertainment.svc.cluster.local port 80 Tautulli: http://tautulli.entertainment.svc.cluster.local port 80 All should have SSL set to false since internal cluster traffic doesn\u0026rsquo;t need TLS termination.\nI also had to fix the link to Plex inside Tautulli by setting it to\nPlex: plex.entertainment.svc.cluster.local port 32400 Why This Happened The original configuration worked with ingress-nginx because it ran with hostNetwork: true, meaning pods accessed services through the host\u0026rsquo;s network stack and never hit the LoadBalancer service abstraction. Envoy Gateway deploys as regular pods without hostNetwork, so any attempt to reach a LoadBalancer IP from inside the cluster goes through the LoadBalancer service.\nWhile Cilium\u0026rsquo;s socketLB is supposed to handle pod-to-LoadBalancer connectivity, the most reliable pattern is to use cluster DNS for internal service-to-service communication. LoadBalancer IPs should only be used for external ingress traffic.\nLesson learned: After migrating from ingress-nginx to Gateway API, audit all application configurations that reference external domain names for internal services. If both the client and server are in-cluster, use cluster DNS (\u0026lt;service\u0026gt;.\u0026lt;namespace\u0026gt;.svc.cluster.local) instead of external domains that resolve to LoadBalancer IPs.\n","date":"2025-11-21T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2025/talos-dr-reset/","title":"Rebuilding My Talos Cluster from Bare Metal"},{"content":" \u0026ldquo;TL;DR Tesla don\u0026rsquo;t make this simple, Kubernetes with proper security makes it worse\u0026rdquo;\nTesla Fleet Integration in Kubernetes: The Real-World Guide \u0026ldquo;Sometimes the best smart home setup is the one that actually works — even if it means fighting Kubernetes for three hours.\u0026rdquo;\nIntro I\u0026rsquo;ve been following This Smart House\u0026rsquo;s excellent guide on setting up Tesla Fleet integration with Home Assistant, but let\u0026rsquo;s be honest — it\u0026rsquo;s written for Home Assistant OS with add-ons, not containerized deployments in Kubernetes.\nAfter wrestling with file permissions, mount conflicts, and Tesla\u0026rsquo;s overly complex API requirements, I finally got it working. Here\u0026rsquo;s what I learned, including all the gotchas that\u0026rsquo;ll save you hours of frustration.\nFull credit to This Smart House for the original guide — this is essentially a Kubernetes adaptation of their excellent work.\nWhy This Integration Matters Tesla\u0026rsquo;s new Fleet API integration lets you:\nLock/unlock your Tesla from Home Assistant Start climate control remotely Monitor charging status and battery levels Control charging rates Create automations (pre-heat before you leave, charge during solar peak, etc.) All without opening the Tesla app. It\u0026rsquo;s the real deal for smart home automation.\nPrerequisites Before diving in, make sure you have:\nTesla account with at least one vehicle Home Assistant running in Kubernetes (I\u0026rsquo;m using the bjw-s app-template) External domain with proper SSL (Tesla won\u0026rsquo;t work with self-signed certs) ingress-nginx controller (I have not yet moved to Gateway API) External-secrets-operator with 1Password (optional, but recommended) The Kubernetes Challenges The original guide assumes you\u0026rsquo;re using Home Assistant OS add-ons for:\nNGINX SSL Proxy Add-on (for serving public keys) Advanced SSH \u0026amp; Web Terminal (for generating encryption keys) File system access (for placing keys in specific locations) In Kubernetes, we need to handle all of this differently.\nStep 1: Generate Tesla Fleet Encryption Keys Tesla requires ECDSA P-256 encryption keys for secure communication. Generate these locally:\n1 2 3 4 5 # Generate private key openssl ecparam -name prime256v1 -genkey -noout -out tesla_fleet.key # Generate public key openssl ec -in tesla_fleet.key -pubout -out public-key.pem Store these safely — you\u0026rsquo;ll need both for the setup.\nStep 2: Tesla Developer Portal Setup Go to developer.tesla.com and create an account Create a new application with these settings: OAuth Grant Type: Authorization Code and Machine-to-Machine Allowed Origin URL: https://subdomain.domain.com e.g. hass.domain.com Redirect URI: https://my.home-assistant.io/redirect/oauth Scopes: Select ALL available scopes (Vehicle Information, Vehicle Location, Vehicle Commands, Vehicle Charging Management, Energy Product Information, Energy Product Commands) Critical gotcha: Make sure you specifiy the sub domain for home-assistant, if you select the root domain this doesnt work.\nStep 3: Serving the Public Key (The Hard Part) Tesla needs to access your public key at https://your-domain.com/.well-known/appspecific/com.tesla.3p.public-key.pem.\nThe Mount Approach (Doesn\u0026rsquo;t Work) My first instinct was to mount the public key file and serve it:\n1 2 3 4 5 6 7 8 9 # DON\u0026#39;T DO THIS - it causes mount conflicts persistence: tesla-keys: type: secret name: home-assistant-secret globalMounts: - path: /.well-known/appspecific/com.tesla.3p.public-key.pem subPath: tesla_public_key readOnly: true This fails spectacularly with \u0026ldquo;not a directory\u0026rdquo; errors because Kubernetes can\u0026rsquo;t create the deep directory structure required.\nThe Working Solution: nginx Ingress Instead, serve the public key content directly via nginx:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ingress: tesla-key: annotations: external-dns.alpha.kubernetes.io/target: external.${SECRET_DOMAIN} nginx.ingress.kubernetes.io/server-snippet: | location = /.well-known/appspecific/com.tesla.3p.public-key.pem { return 200 \u0026#34;-----BEGIN PUBLIC KEY----- YOUR_ACTUAL_PUBLIC_KEY_CONTENT_HERE -----END PUBLIC KEY-----\u0026#34;; add_header Content-Type application/x-pem-file; } className: external hosts: - host: hass.${SECRET_DOMAIN} paths: - path: /.well-known/appspecific/com.tesla.3p.public-key.pem pathType: Exact service: identifier: app port: http This bypasses all the file mounting complexity and serves the key directly from nginx.\nStep 4: The Private Key Challenge Home Assistant needs the private key at /config/tesla_fleet.key. With readOnlyRootFilesystem: true, you can\u0026rsquo;t just write files to the container.\nThe Permission Dance Here\u0026rsquo;s the process that actually works:\n1. Temporarily disable read-only filesystem:\n1 2 3 4 5 6 containers: app: securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: false # Temporarily disable capabilities: {drop: [\u0026#34;ALL\u0026#34;]} 2. Apply and create the file:\n1 2 3 4 5 kubectl exec -it deployment/home-assistant -c app -- sh -c \u0026#39;cat \u0026gt; /config/tesla_fleet.key \u0026lt;\u0026lt; \u0026#34;EOF\u0026#34; -----BEGIN EC PRIVATE KEY----- YOUR_PRIVATE_KEY_CONTENT_HERE -----END EC PRIVATE KEY----- EOF\u0026#39; 3. Re-enable read-only filesystem:\n1 2 3 4 5 6 containers: app: securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true # Re-enable capabilities: {drop: [\u0026#34;ALL\u0026#34;]} The file persists on the PVC even after re-enabling read-only mode.\nStep 5: Domain Registration with Tesla This step is critical and often overlooked. You must register your domain with Tesla\u0026rsquo;s Fleet API before they\u0026rsquo;ll accept your public key.\nGet an access token:\n1 2 3 4 5 6 7 8 curl --request POST \\ --header \u0026#39;Content-Type: application/x-www-form-urlencoded\u0026#39; \\ --data-urlencode \u0026#39;grant_type=client_credentials\u0026#39; \\ --data-urlencode \u0026#39;client_id=YOUR_CLIENT_ID\u0026#39; \\ --data-urlencode \u0026#39;client_secret=YOUR_CLIENT_SECRET\u0026#39; \\ --data-urlencode \u0026#39;scope=openid vehicle_device_data vehicle_cmds vehicle_location vehicle_charging_cmds energy_device_data energy_cmds\u0026#39; \\ --data-urlencode \u0026#39;audience=https://fleet-api.prd.na.vn.cloud.tesla.com\u0026#39; \\ \u0026#39;https://fleet-auth.prd.vn.cloud.tesla.com/oauth2/v3/token\u0026#39; Register your domain:\n1 2 3 4 curl --location \u0026#39;https://fleet-api.prd.na.vn.cloud.tesla.com/api/1/partner_accounts\u0026#39; \\ --header \u0026#39;Content-Type: application/json\u0026#39; \\ --header \u0026#39;Authorization: Bearer YOUR_ACCESS_TOKEN\u0026#39; \\ --data \u0026#39;{\u0026#34;domain\u0026#34;: \u0026#34;your-hass-domain.com\u0026#34;}\u0026#39; Include ALL scopes in the token request — missing scopes here will limit what the Home Assistant integration can do later.\nStep 6: File Permission Gotchas If you run into permission issues (and you probably will), here\u0026rsquo;s the fix that actually works:\nUpdate your pod security context:\n1 2 3 4 5 6 7 8 defaultPodOptions: securityContext: runAsNonRoot: true runAsUser: 568 runAsGroup: 568 fsGroup: 568 fsGroupChangePolicy: Always # This is critical seccompProfile: {type: RuntimeDefault} The key insight: fsGroupChangePolicy: Always ensures all files created in mounted volumes get proper ownership, preventing the auth file permission errors that can break Home Assistant startup.\nStep 7: Home Assistant Integration Once everything is set up:\nAdd the Tesla Fleet integration in Home Assistant Enter your Client ID and Secret Set private key path to: /config/tesla_fleet.key Complete the OAuth flow If you\u0026rsquo;ve done everything correctly, you\u0026rsquo;ll get all scopes including Vehicle Commands, and you can finally control your Tesla from your smart home.\nThe Complete Working Configuration Here\u0026rsquo;s my final working configuration:\nAfter (working) — serving public key via nginx, manual private key creation\nKey Lessons Learned Tesla API Gotchas:\nHASS subdomin must be used Domain registration is mandatory before public key acceptance All scopes must be included in the domain registration token Public key format is extremely picky about line breaks Kubernetes Gotchas:\nCan\u0026rsquo;t mount files to deep directory paths (/.well-known/appspecific/...) readOnlyRootFilesystem prevents file creation even in mounted volumes File permissions get messy when switching between privileged/unprivileged containers fsGroupChangePolicy: Always is essential for proper file ownership Home Assistant Gotchas:\nThe integration requires the private key file path, not the content OAuth scopes are determined by your Tesla Developer app, not the HA integration Deleting and re-adding the integration is often necessary for scope updates The Stupid Simple Alternative If you\u0026rsquo;re reading this and thinking \u0026ldquo;this is way too complex,\u0026rdquo; you\u0026rsquo;re right. Most people should probably just:\nUse Home Assistant OS Install the NGINX add-on Follow the original guide But if you\u0026rsquo;re committed to running everything in Kubernetes (like I am), this is how you make it work.\nThe Result Final Thoughts Tesla\u0026rsquo;s Fleet API setup is unnecessarily convoluted for what should be a straightforward integration. The combination of encryption keys, domain registration, public key serving, and OAuth flows feels like security theater more than actual security.\nThat said, once it\u0026rsquo;s working, the integration is solid. Being able to pre-condition your Tesla from a Home Assistant automation, or automatically adjust charging rates based on solar production, makes the setup pain worth it.\nFull credit again to This Smart House for the original comprehensive guide. Their work made this Kubernetes adaptation possible.\nNow if you\u0026rsquo;ll excuse me, I\u0026rsquo;m going to go start my car\u0026rsquo;s climate control from my smartwatch like the proper nerd I am.\nYou can find my complete home-ops configuration on GitHub. Feel free to steal whatever works for your setup.\n","date":"2025-08-08T00:00:00+12:00","permalink":"https://blog.nerdz.cloud/2025/tesla-fleet-hass/","title":"Tesla Integration with Home Assistant"},{"content":" \u0026ldquo;High availability for LLMs doesn\u0026rsquo;t need to be hard. Just give them a shared brain and a quiet place to think.\u0026rdquo;\nIntro This is the continuation of my open source LLM deployment journey. In Part 2, I got Open-WebUI up and running with a connection to OpenAI and some solid OIDC-based auth.\nNow it\u0026rsquo;s time to actually host my own models — enter Ollama.\nWhy Ollama? Ollama is a great backend for running open-source models like Mistral, DeepSeek Coder, TinyLLaMA, and many others. It offers:\nA clean CLI and API OpenAI-compatible endpoints Easy Docker-based deployment Excellent model ecosystem \u0026hellip;but it also comes with a few quirks — especially when trying to run it in Kubernetes. Let’s dive into how I set it up and what I learned.\nGoals Deploy Ollama in Kubernetes Support High Availability across 3 nodes Share model and config volumes across pods Preload models to avoid UI pulls (at first) Integrate with Open-WebUI Why I chose a DaemonSet Most guides suggest deploying Ollama as a single Deployment, but I had other plans. I wanted Ollama to:\nBe available on each node Share model storage (so models don\u0026rsquo;t redownload per pod) Provide resilience — if one node goes down, the others keep serving For that reason, I rolled it out as a DaemonSet, with a shared RWX volume for models and config. That means each node hosts its own Ollama pod, but they all read/write to the same /models and /root/.ollama paths.\nThis was critical because:\nDownloads happen once Config/state persists Open-WebUI can talk to any node Storage Setup I used a CephFS-backed PVC with ReadWriteMany access mode to share data across pods. Here\u0026rsquo;s the relevant snippet:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 persistence: models: enabled: true existingClaim: ollama-models-shared advancedMounts: ollama: app: - path: /models config: enabled: true existingClaim: ollama advancedMounts: ollama: app: - path: /root/.ollama If you\u0026rsquo;re running the bjw-s app-template Helm chart, that’s how you wire volumes into the containers cleanly.\nWhat about model preloading? I initially tried to preload models using an initContainer, but Ollama’s CLI expects a running server in the background for ollama pull to work. That meant the init container crashed trying to pull models before the server was up.\nI then considered a Kubernetes Job that spun up a temporary ollama serve, pulled the models, and exited, but in the end — I decided to keep it simple.\nPulling Models via UI Instead of declarative model pulls, I now:\nBrowse to you Open-WebUI URL\nLog in as admin\nSelect your avatar in the top right and click Admin Panel\nIn the top left menu click Settings\nThen in the left menu click Models\nHere you will see all your OpenAI/ChatGPT models if you followed that step from my previous post\nClick on the download icon in the top right\nNext, look at ollamas supported models. found here.\nSearch for each model that you want, click on it, then tags drop down and click View All make a nodel of the model and tag you want: tinyllama:1.1b mistral:7b-instruct-q4_0 deepseek-coder:6.7b Click the download Icon next to where you pasted the model:tag\nYou will see the model begin to download and will see topups in the top right about it verifying the SHA\nAfter that is complete, refresh the admin page and click models from the left manu again and scroll down through the list to see your new model\nRepeat this process for your other models\nLets verify, from your Terminal run the following command which will grab the first ollama container and report the models:\nkubectl -n cortex exec -it $(kubectl -n cortex get pods -l app.kubernetes.io/name=ollama -o jsonpath='{.items[0].metadata.name}') -- ollama list\nRemember, if you want to change your default model, click the cog next to the download icon you clicked earlier. here you can reorder the model list and choose your default.\nNote Because all my Ollama pods share the same /models volume, once a model is downloaded, it becomes instantly available to all nodes.\nGotchas You can’t use ollama pull unless the server is running. Init containers don’t work unless you do a background process trick. Use a DaemonSet only if you have shared storage. Otherwise you’ll redownload the same models three times. Open-WebUI doesn’t preload models declaratively. Use the admin UI. Text Only These models are not able to read images an interpert text. My Current State You can see the full setup in my home-ops repo here:\nOllama HelmRelease Open-WebUI HelmRelease The result: I now have a high-availability Ollama setup across 3 nodes, talking to a clean OIDC-protected Open-WebUI front-end, and serving both OpenAI models and my own local ones — no cloud dependencies required.\nNext Steps\nI\u0026rsquo;m watching PR #6729 on the Ollama repo closely. When it lands, I plan to test distributed inference across all nodes — allowing parallel execution and true multi-node load balancing.\nI\u0026rsquo;m also keeping an eye on vLLM, which is designed for blazing-fast inference and serves models with lower latency, dynamic batching, and high throughput — ideal for production-level performance. It’s more complex to integrate and resource-hungry, but I’m planning to explore it for my homelab in a future post.\nStay tuned for Part 4 👀\n","date":"2025-04-10T00:00:00+12:00","permalink":"https://blog.nerdz.cloud/2025/deploying-open-llms-03/","title":"Deploying Open Source LLMs in a Homelab - Part 3"},{"content":" \u0026ldquo;“Ideas are cheap. The real magic happens when those ideas survive YAML, GitOps, and Grafana dashboards.”\u0026rdquo;\nIntro In my last post, I talked about my intent. In this post I will document what I am actually doing (and seeing if intent matches reality)\nFirst up is understanding Open-WebUI\nOpen Web UI Website | Github | Documentation\nOpen WebUI lets you run and talk to AI models locally from your browser — no internet or cloud required. It connects to model backends like Ollama or anything OpenAI-compatible, and comes with advanced features like smart document search (RAG) built-in.\nThis will be our front end (website) that lets us interact with both local models (hosted in our k8s cluster) and remote models (like ChatGPT). This is the ideal place to start before rolling our your own models.\nWhat is RAG? Retrieval-Augmented Generation (RAG) is a way of helping AI models give better answers by letting them search through documents or notes you provide — like giving the AI a memory or reference book it can read from before responding.\nReading the documentation and looking for ENV Vars One of the first things I want to do is read through the documentation and check that the default values for things are set in the way I would want them to be, and pulling out the ones that are not so I can change them in my deployment.\nValues I need to set and settings I need to change Some of these I will set in the helmrelease.yaml and others in the externalsecrets.yaml conventionally our would just store secrets in the external secret file but you can also store other ENV VARS there too if you dont want to bloat our your helm release too much.\nexternalsecrets.yaml\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Open-WebUI Config OPENAI_API_KEY: \u0026#34;{{ .OPENAI_API_KEY }}\u0026#34; ADMIN_EMAIL: \u0026#34;{{ .OPENAI_ADMIN_EMAIL }}\u0026#34; ENABLE_ADMIN_CHAT_ACCESS: \u0026#34;true\u0026#34; ENABLE_ADMIN_EXPORT: \u0026#34;true\u0026#34; DEFAULT_USER_ROLE: \u0026#34;user\u0026#34; DATABASE_URL: \u0026#34;postgres://{{ .OPENWEBUI_DB_USER }}:{{ .OPENWEBUI_DB_PASSWORD }}@postgres17-rw.database.svc.cluster.local:5432/openwebui?sslmode=disable\u0026#34; WEBUI_SECRET_KEY: \u0026#34;{{ .WEBUI_SECRET_KEY }}\u0026#34; # Pocket ID Config OAUTH_PROVIDER_NAME: pocketid OAUTH_CLIENT_ID: \u0026#34;{{ .OPENWEBUI_POCKETID_CLIENTID }}\u0026#34; OAUTH_CLIENT_SECRET: \u0026#34;{{ .OPENWEBUI_POCKETID_SECRET }}\u0026#34; OPENID_PROVIDER_URL: \u0026#34;{{ .OPENWEBUI_POCKETID_DISCOVERY }}\u0026#34; OPENID_REDIRECT_URI: \u0026#34;{{ .OPENWEBUI_POCKETID_REDIRECT }}\u0026#34; OAUTHS_SCOPE: openid profile email helmrelease.yaml\n1 2 3 4 5 6 7 8 9 10 11 GLOBAL_LOG_LEVEL: \u0026#34;DEBUG\u0026#34; ENABLE_LOGIN_FORM: \u0026#34;false\u0026#34; OAUTH_MERGE_ACCOUNTS_BY_EMAIL: true ENABLE_OPENAI_API: \u0026#34;true\u0026#34; ENABLE_OAUTH_SIGNUP: \u0026#34;true\u0026#34; ENABLE_WEBSOCKET_SUPPORT: \u0026#34;true\u0026#34; WEBSOCKET_MANAGER: \u0026#34;redis\u0026#34; WEBSOCKET_REDIS_URL: \u0026#34;redis://dragonfly.database.svc.cluster.local:6379\u0026#34; ENABLE_RAG_WEB_SEARCH: true RAG_WEB_SEARCH_ENGINE: searxng SEARXNG_QUERY_URL: http://searxng.services.svc.cluster.local:8080/search?q=\u0026lt;query\u0026gt; Open-WebUI ENV Var Notes ENABLE_LOGIN_FORM Type: bool Default: True Description: Toggles email, password, sign in and \u0026ldquo;or\u0026rdquo; (only when ENABLE_OAUTH_SIGNUP is set to True) elements. Persistence: This environment variable is a PersistentConfig variable. ⚠️ DANGER\nThis should only ever be set to False when ENABLE_OAUTH_SIGNUP is also being used and set to True.\nFailure to do so will result in the inability to login.\nENABLE_OAUTH_SIGNUP Type: bool Default: False Description: Enables account creation when signing up via OAuth. Distinct from ENABLE_SIGNUP. Persistence: This environment variable is a PersistentConfig variable. ⚠️ DANGER\nENABLE_LOGIN_FORM must be set to False when ENABLE_OAUTH_SIGNUP is set to True. Failure to do so will result in the inability to login.\nRAG_WEB_SEARCH_ENGINE Type: str (enum) 🔍 RAG_WEB_SEARCH_ENGINE Options: Comparison Table Engine Description Pros Cons searxng Uses the SearXNG engine ✅ Self-hostable, privacy-friendly, highly customizable ❌ May require setup and maintenance google_pse Google Programmable Search Engine ✅ Accurate, well-indexed, powerful relevance ranking ❌ API limits, requires API key brave Brave Search ✅ Independent index, private, fast ❌ May lack depth compared to Google kagi Kagi Search ✅ Human-curated results, privacy-respecting ❌ Paid subscription required for full access mojeek Mojeek ✅ Independent crawler, no tracking ❌ Results less relevant for niche topics serpstack Serpstack ✅ Easy API for Google results ❌ Commercial service, requires API key serper Serper ✅ Google-like output, simple API ❌ API limits, free tier capped serply Serply ✅ Tailored for AI + LLM use cases ❌ Smaller user base, may have reliability issues SerpApi SerpApi ✅ search API that supports many engines with quick response times Setup done via ui. See here duckduckgo DuckDuckGo ✅ Privacy-first, no tracking ❌ No real API (scraped or proxied, limited metadata) tavily Tavily ✅ AI-tuned search for RAG, fast ❌ Still new, smaller index jina Jina AI ✅ Vector-aware search options ❌ Focused more on enterprise \u0026amp; vector DBs bing Microsoft Bing search engine ✅ Wide coverage, high-quality results ❌ Requires API key, tracking concerns Note I will be using Searxng which will require me to deploy that BEFORE I can proceed.\nDeploying SearXNG As is tradition, I will be walking on the shoulders of giants and taking advantage of kubesearch.dev, an amazing website that:\nSearch Flux HelmReleases through awesome k8s-at-home projects, check it out at https://kubesearch.dev/. We index Flux HelmReleases from Github and Gitlab repositories with the k8s-at-home topic and kubesearch topic. To include your repository in this search it must be public and then add the topic k8s-at-home or kubesearch to your GitHub Repository topics.\nMy Deployment of SearXNG can be found in my home-ops repo on github\nThere were a couple of interesting learnings from this deployment\nIn the settings.yaml file I wanted to set it up so I could do some regionalised searches so that I could get results for different countries but that, by default, I would get NZ results.\nHere are the things I did:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 search: autocomplete: google favicon_resolver: duckduckgo default_lang: en-NZ languages: - en-AU - en-CA - en-GB - en-NZ - en-US ui: ... default_locale: en engines: - name: google engine: google shortcut: g parameters: - hl: en - gl: nz # This tells Google: \u0026#34;give me NZ-localized results\u0026#34; - cr: countryNZ - tbs: ctr:countryNZ This allows me to (by default) get localised NZ searches and then just change the language drop down to switch to Canadian, United Kindom, United States or Australian searches\nDeploying Open-WebUI This was a wild ride. Here are the things I wanted to achieve intially.\nOpen-WebUI deployed in a basic fashion Connected to my paid OpenAI ChatGPT account Login handled by PocketID OIDC Sharing of OpenAI Model across users in my instance of Open-WebUI There was some fenagling and misinterpreting of environment variables (there are soo many) But, I got there in the end. You can see my initial (working) deployment here and my current state here\nGaining access to OpenAI (chatGPT models from a free or paid account) Browse to https://openai.com/ and Click Log In followed by API Platform If this is your first time here, you will likely need to set an Organization name. I chose to call mine after my cluster Once logged in, in the left menu, click API keys and in the top right click Create new secret key Give the secret a name e.g. Open-WebUI and assign it to a project (if you have not set any up, then default is fine) Click Create Secret key and copy the value that shows up and store it in your secrets manager under the value OPENAI_API_KEY (See my externalsecrets.yaml example below) Deployment Learnings helmrelease.yaml\n1 2 3 4 5 6 7 8 9 10 11 12 env: GLOBAL_LOG_LEVEL: \u0026#34;DEBUG\u0026#34; ENABLE_LOGIN_FORM: \u0026#34;false\u0026#34; OAUTH_MERGE_ACCOUNTS_BY_EMAIL: true ENABLE_OPENAI_API: \u0026#34;true\u0026#34; ENABLE_OAUTH_SIGNUP: \u0026#34;true\u0026#34; ENABLE_WEBSOCKET_SUPPORT: \u0026#34;true\u0026#34; WEBSOCKET_MANAGER: \u0026#34;redis\u0026#34; WEBSOCKET_REDIS_URL: \u0026#34;redis://dragonfly.database.svc.cluster.local:6379\u0026#34; ENABLE_RAG_WEB_SEARCH: true RAG_WEB_SEARCH_ENGINE: searxng SEARXNG_QUERY_URL: http://searxng.services.svc.cluster.local:8080/search?q=\u0026lt;query\u0026gt; Make sure you set the Log Level to Debug, it makes deployment and troubleshooting much easier 🤣 ENABLE_LOGIN_FORM: \u0026quot;false\u0026quot; This need to be false if you are using OIDC ENABLE_OAUTH_SIGNUP: \u0026quot;true\u0026quot; If you don\u0026rsquo;t have this set, then your OIDC provider (PocketID in my case), can\u0026rsquo;t create an account inside Open-WebUI externalsecret.yaml\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Open-WebUI Config OPENAI_API_KEY: \u0026#34;{{ .OPENAI_API_KEY }}\u0026#34; ADMIN_EMAIL: \u0026#34;{{ .OPENAI_ADMIN_EMAIL }}\u0026#34; ENABLE_ADMIN_CHAT_ACCESS: \u0026#34;true\u0026#34; ENABLE_ADMIN_EXPORT: \u0026#34;true\u0026#34; DEFAULT_USER_ROLE: \u0026#34;user\u0026#34; DATABASE_URL: \u0026#34;postgres://{{ .OPENWEBUI_DB_USER }}:{{ .OPENWEBUI_DB_PASSWORD }}@postgres17-rw.database.svc.cluster.local:5432/openwebui?sslmode=disable\u0026#34; WEBUI_SECRET_KEY: \u0026#34;{{ .WEBUI_SECRET_KEY }}\u0026#34; # Pocket ID Config OAUTH_PROVIDER_NAME: pocketid OAUTH_CLIENT_ID: \u0026#34;{{ .OPENWEBUI_POCKETID_CLIENTID }}\u0026#34; OAUTH_CLIENT_SECRET: \u0026#34;{{ .OPENWEBUI_POCKETID_SECRET }}\u0026#34; OPENID_PROVIDER_URL: \u0026#34;{{ .OPENWEBUI_POCKETID_DISCOVERY }}\u0026#34; OPENID_REDIRECT_URI: \u0026#34;{{ .OPENWEBUI_POCKETID_REDIRECT }}\u0026#34; OAUTHS_SCOPE: openid profile email PROVIDER_URL and DISCOVERY-URL are the same damn thing, but different tools call them different things. This should be set to: https://{your OIDC url}/.well-known/openid-configuration OPENID_REDIRECT_URI Make sure that the path for this is: https://{Open-WebUI URL}/oauth/oidc/callback. This needs to be set BOTH in your OIDC config AND your ENV Var in externalsecrets Configuration learnings Setting up Groups If you plan to have more than one user then you should probably setup groups.\nEnsure the other users have logged in via OIDC at least once to have their accounts created Navigate to https://{Open-WebUI URL}/admin/users and click on Groups. Click the Plus in the top right to create a new group and give it a name (and a description if needed) and click Create Viewing your new group Click the ✏️ pencil in the top right to edit it Click Permissions and reivew them, defaults are likely fine but you may want to make some adjustments Click on users and check the box next to each user you want to add to the group Allowing model access If like me, you configured OPENAI_API_KEY in your externalsecret then you will have access to ALL the OpenAI (ChatGPT) models that you plan allows\u0026hellip;There is a lot If you did not (and want to) you will need to go through the process of generating an API Key and adding it to your externalsecret.yaml see above\nNavigate to your admin settings https://{Open-WebUI URL}/admin/settings In the left manu click on models Here you will see a massive list, feel free to disable as many of these as you see fit. I only retained the following: gpt-3.5-turbo gpt-4 gpt-4-turbo gpt-4o gpt-4o-mini Once you have your list, for each one click on the ✏️ pencil Under Visibility click Select a group and select the group you created earlier. Click Save \u0026amp; Update Your other users now have access to use that model Repeat this process for the other models. Chat History If you are wanting your chat history from ChatGPT you will need to find a way import it directly into the Open-WebUI Database, There is no sync function between Open-WebUI and chat.openai.com\nNext Steps Next steps will be looking to deploy my own models locally so that long term I have no reliance on paid external tools like OpenAI\u0026rsquo;s ChatGPT\n","date":"2025-04-07T00:00:00+12:00","permalink":"https://blog.nerdz.cloud/2025/deploying-open-llms-02/","title":"Deploying Open Source LLMs in a Homelab - Part 2"},{"content":" \u0026ldquo;The real question is not whether machines think, but whether humans do.\u0026rdquo;\n— B.F. Skinner\nIntro Unless you’ve been living under a pile of failed kubectl commands, you’ve probably noticed AI is everywhere. From image generators that make photorealistic art in seconds, to chatbots that can walk you through debugging your Docker Compose file, AI is embedded in daily life now.\nBut here’s the thing — not all AI is created equal.\nThere’s a ton of jargon flying around: LLMs, ML, AI, Generative AI, Transformers, Diffusion Models\u0026hellip; and people use them interchangeably (wrongly, I might add). So let’s break this all down, and throw in some opinionated takes on open vs closed source models while we’re at it.\nThe Big Three Here’s a super simplified breakdown:\n1. Artificial Intelligence (AI) This is the umbrella term. Anything that simulates human intelligence — planning, reasoning, problem-solving — falls under this.\n2. Machine Learning (ML) ML is a subset of AI. It focuses on training algorithms with data so that they can learn and make predictions or decisions without being explicitly programmed to do so.\nThink:\nSorting your spam emails Recommending you another \u0026ldquo;Linux ISO\u0026rdquo; to download (yeah, sure 😏) 3. Generative AI This is a subset of ML. It’s trained to create new content — text, images, audio, video — based on learned patterns. It’s what powers:\nChatGPT (text) Midjourney, Stable Diffusion (images) Suno, MusicGen (audio) So where do LLMs come into this?\nLLMs – Large Language Models These are the generative AI brainiacs that focus specifically on text.\nThey\u0026rsquo;re the ones reading docs, summarizing PDFs, hallucinating package install instructions, or writing spicy Kubernetes blogs for you.\nLLMs are trained on massive amounts of text data and rely heavily on a technique called Transformers, which is what revolutionized the AI world circa 2017.\nPopular LLMs you’ve probably heard of:\nGPT-4 (OpenAI, closed source) Claude (Anthropic, closed source) Mistral (Open source, 🔥 fast rising star) LLaMA (Meta, open-ish source — depends who you ask) Closed Source vs Open Source Models This is where things get juicy.\n🧱 Closed Source AI (GPT, Claude, Gemini, etc.) These are typically the domain of Big Tech. They don\u0026rsquo;t share their training data, weights, or inner workings. You’re locked into their ecosystem and pricing models.\nPros:\nGenerally higher accuracy (as of now) Huge funding = more compute = more training Easier to access if you\u0026rsquo;re non-technical (nice UIs, APIs, etc) Cons:\nNo insight into what\u0026rsquo;s under the hood Limited customization Expensive at scale Not privacy-friendly Note: Many of the closed source Generative AI tools do release their models for use on Ollama\n🔓 Open Source AI (Mistral, LLaMA, TinyLLaMA, Mixtral, etc.) These models release their weights, training data (sometimes), and usually run great on your own hardware or cluster.\nPros:\nComplete control \u0026amp; transparency Host it yourself (goodbye API limits!) Customize it to your needs (e.g., domain-specific fine-tuning) Huge OSS community — collaboration is fast-paced Cons:\nRequires more technical know-how May lag slightly behind in benchmark scores Inference can be slower without proper infra (aka don’t expect 7B models to run well on a Pi) How They Interact This is something a lot of folks miss: these aren\u0026rsquo;t siloed systems. Here\u0026rsquo;s how the puzzle fits together:\nML is the foundation. It\u0026rsquo;s how all these models are trained. LLMs are a type of ML model, focused on language. They use ML principles. Generative AI is a purpose: to generate — and LLMs fall under this when they generate text. You interact with Generative AI (via UI, chat, API) → the underlying LLM runs inference → built on top of ML training → likely trained using massive GPU farms, a few metric tons of Reddit data, and questionable StackOverflow posts. My Perspective Right now, I’m in the process of prepping my environment to run open source LLMs (think: Mistral, TinyLLaMA, and Code LLaMA) directly on my own infrastructure.\nThe goal?\nLocal, private, fast models No reliance on cloud APIs Fully integrated with my Kubernetes setup I’ll be writing about that setup (and the fun/chaos of queue-based autoscaling, Grafana monitoring, and model orchestration) in an upcoming post.\nFor now, just know: open source is not only viable — it\u0026rsquo;s starting to lead the charge.\nTL;DR AI is the umbrella. ML teaches AI how to behave (The building blocks of Generative AI). Generative AI creates content. LLMs are language specialists in the generative AI world. Open source is rising fast and worth your attention. My next post will cover how I run LLMs on-demand in a homelab Kubernetes cluster. Until then, keep your pods healthy, your YAML clean, and your tokens per second high.\n","date":"2025-04-06T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2025/understanding-ai/","title":"Understanding AI: Generative AI, LLMs, ML \u0026 the Open vs Closed Source Debate"},{"content":" \u0026ldquo;A lesson learned and not remembered is a mistake waiting to happen again.\u0026rdquo;\n— Unknown\nAll activity halted I awoke Sunday morning to see that all my Linux ISOs had stopped downloading. All stalled. I kicked my container, but alas, no change\nChecking my logs I could see that my wg0 (Wireguard) interface was in an Unknown state and that I had no connection to the outside world.\nMy Setup I run a pod with 3 containers in it\nQBitorrent - cross-platform free and open-source BitTorrent client written in native C++. Glueten - VPN client in a thin Docker container for multiple VPN providers, written in Go, and using OpenVPN or Wireguard, DNS over TLS, with a few proxy servers built-in. gluetun-qb-port-sync - As its written on the tin, sync the ports between gluten and qbitorrent This is coupled with ProtonVPN (VPN Plus).\nThe errors 1 2 3 4 5 6 7 8 9 10 11 12 gluetun 2025-02-02T19:28:42Z INFO [wireguard] Wireguard setup is complete. Note Wireguard is a silent protocol and it may or may not work, without giving any error message. Typically i/o timeout errors indicate the Wireguard connection is not working. gluetun 2025-02-02T19:28:55Z INFO [healthcheck] program has been unhealthy for 11s: restarting VPN (healthcheck error: dialing: dial tcp4: lookup cloudflare.com: i/o timeout) luetun 2025-02-02T19:28:55Z INFO [healthcheck] ≡ƒæë See https://github.com/qdm12/gluetun-wiki/blob/main/faq/healthcheck.md gluetun 2025-02-02T19:28:55Z INFO [healthcheck] DO NOT OPEN AN ISSUE UNLESS YOU READ AND TRIED EACH POSSIBLE SOLUTION luetun 2025-02-02T19:28:55Z INFO [vpn] stopping gluetun 2025-02-02T19:28:55Z ERROR [vpn] waiting for DNS to be ready: context canceled gluetun 2025-02-02T19:28:55Z ERROR [vpn] getting public IP address information: context canceled gluetun 2025-02-02T19:28:55Z INFO [port forwarding] starting gluetun 2025-02-02T19:28:55Z ERROR [vpn] starting port forwarding service: getting VPN assigned IP address: network interface wg0 not found: route ip+net: no such network interface gluetun 2025-02-02T19:28:55Z INFO [vpn] starting gluetun 2025-02-02T19:28:55Z INFO [firewall] allowing VPN connection... gluetun 2025-02-02T19:28:55Z INFO [wireguard] Using available kernelspace implementation Being a networking n00b, I could not figure out what the issue was. My Subscription was still valid, nothing else had changed\u0026hellip;\nCulprit Found After posting a support ticket in the Home-Operations Discord, a very helpful community member suggested that I recreate my wireguard config file and apply it.\nI log into ProtonVPN Downloads page to generate the wireguard config (you have to scroll down below OpenVPN) and I see this:\nMy Credentials had expired\u0026hellip;\nThe Fix Generate a new config file using the settings below. It should automatically pickup the best server for you and show it on step4\nDownload the config file and open it in your favorite text editor\nInside you will have four values you will need\nEndpoint PublicKey PrivateKey Address Assuming you are using the same setup as I am\nYou will need to update 1Password with the new values\nEndpoint = QBITTORRENT_VPN_ENDPOINT_IP PublicKey = QBITTORRENT_WIREGUARD_PUBLIC_KEY PrivateKey = QBITTORRENT_WIREGUARD_PRIVATE_KEY Address = QBITTORRENT_WIREGUARD_ADDRESSES Once this is done, you can run the following command to annotate the secret file so it updates with the changes (change the namespace and secret name accordingly)\n1 kubectl --namespace downloads annotate externalsecret qbittorrent force-sync=$(date +%s) --overwrite Then you can confirm the secret has changed using\n1 kubectl get secret -n downloads qbittorrent-secret -o json | jq -r \u0026#39;.data | to_entries[] | \u0026#34;\\(.key): \\(.value | @base64d)\u0026#34;\u0026#39; Once that is done, your Qbittorrent container should terminate and spin up a fresh container. Give it a few minutes and downloading should resume.\n","date":"2025-02-03T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2025/qbitorrent-woes/","title":"QBitorrent Woes"},{"content":" \u0026ldquo;The most powerful database is the one you don’t have to recover.\u0026rdquo;\n— Unknown\nRecovery gone bad One of the nice things about how I have my Kubernetes cluster configured is that I use Volsync paired with backblaze B2 and Cloudflare R2 to backup (and recover) all my PVCs for my containers. This means that When I need to totally blow away a container because something has gone wrong, volsync will reach out to my replicationDestination (Backblaze) and pull the latest backup down to build out a new PVC locally.\nRecently, I had to shutdown my entire cluster due to my local power company needing to install a new \u0026ldquo;Smart Metre\u0026rdquo;. Under normal circumstances, when the cluster stood back up, Volsync would reach out to Backblaze and rebuild all the PVCs. And this worked almost flawlessly.\nOnly one Postgres node was able to recover, the other two got the following error.\n1 file name too long for tar format This seems to have been cause by a bug.\nI was not running standard Cloudnative Postgres. In the past, I had run an application called immich which is both amazing, and aweful at the same time.\nAmazing because it did everything I wanted a photo application to do, aweful because nearly every update was a breaking change.\nEventually I got rid of immich. However, Immich needed an extention called pgvecto.rs in Postgres in order for it work correctly. To make my life easier at the time, I ran a custom image of Cloudnative Postgres from tensorchord with pgvecto.rs baked in.\nDue to the aforementioned bug, I was not able to restore because the file names for some of the backed up files were too large and so I was only running on a single node.\nFailed attempts to recover So what did I try to resolve the issue?\nDeleted the Immich DB and confirmed no other databases used the pgvecto.rs extension Forced a new backup to recover from Migrated to vanilla Cloudnative-PG Forced a new backup to recover from None of these things worked. After talking to the amazing community in the Home Operations Discord there was only one path forward\nDatabase Migration High Level Process I was going to need to run a brand new Cloudnative-PG v17.2 cluster side by side with my existing Cloudnative-PG v16.3-7 cluster and import my existing databases contents to the new database.\nThankfully, this is much easier than I first thought.\nBaked in support for Recovery and existing clusters I was already using Cloudnative-PGs recovery process as a way to recover when I had DB issues\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 backup: retentionPolicy: 30d barmanObjectStore: \u0026amp;barmanObjectStore data: compression: bzip2 wal: compression: bzip2 maxParallel: 8 destinationPath: s3://nerdz-cloudnative-pg/ endpointURL: https://s3.us-east-005.backblazeb2.com serverName: \u0026amp;currentCluster postgres16-v3 s3Credentials: accessKeyId: name: cloudnative-pg-secret key: aws-access-key-id secretAccessKey: name: cloudnative-pg-secret key: aws-secret-access-key # Note: externalClusters is needed when recovering from an existing cnpg cluster bootstrap: recovery: source: \u0026amp;previousCluster postgres16-v1 # Note: externalClusters is needed when recovering from an existing cnpg cluster externalClusters: - name: *previousCluster barmanObjectStore: \u0026lt;\u0026lt;: *barmanObjectStore serverName: *previousCluster This block of code is what I use to recover my Database under normal circumstances. Just give the currentCluster a new number and specify the number for the old cluster and viola, recovered.\nBut in this case, I was going to have to do something a little different\nHow to recover from an existing (external) cluster In the new cluster files change the Bootstrap and externalClusters to look like this\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 bootstrap: # recovery: # source: \u0026amp;previousCluster postgres17v1 initdb: import: type: monolith databases: [\u0026#34;*\u0026#34;] roles: [\u0026#34;*\u0026#34;] source: externalCluster: cnpg16 # Note: externalClusters is needed when recovering from an existing cnpg cluster externalClusters: # - name: *previousCluster # barmanObjectStore: # \u0026lt;\u0026lt;: *barmanObjectStore # serverName: *previousCluster - name: cnpg16 connectionParameters: host: postgres16-rw.database.svc.cluster.local user: postgres dbname: postgres password: name: cloudnative-pg-secret key: password You will see here that the host for the initDb is my existing Cloudnative-PG clusters service\npostgres16-rw.database.svc.cluster.local this translate to serviceName.NameSpace.svc.cluster.local\nYou can see my files on Github\nAfter pushing the changes you will see an import pod turn up which will pull all the data over and then the Cloudnative-PG Operator will spin up a pod with Cloudnative-PG v17 running\nIn my case, only a single pod, as I had scaled down Cloudnative-PG v16 to a single pod to remove all the error from the restore, and I was lasy and copy pasted all the old files when creating the new cluster and missed this detail when making my changes.\nScaling up the deployment to 3 1 2 3 4 5 6 7 # yaml-language-server: $schema=https://kubernetes-schemas.pages.dev/postgresql.cnpg.io/cluster_v1.json apiVersion: postgresql.cnpg.io/v1 kind: Cluster metadata: name: postgres17 spec: instances: 3 Set the instances to 3 and push the change\nusing k9s I can watch the process unfold. First you will see a post named postgres17-2-join-garbageString as it joins the new pod at which point you will see the postgres17-2-garbageString pod running nicely. Eventually, you will see something like this\nNext Steps Migrating apps to using the new cluster Take a look at your services in your database namespace.\n1 kubectl get svc -n database | grep postgres and you will see something like this\n1 2 3 4 5 6 7 8 postgres-lb LoadBalancer 10.96.91.59 10.90.3.203 5432:32588/TCP 11d postgres16-r ClusterIP 10.96.150.103 \u0026lt;none\u0026gt; 5432/TCP 11d postgres16-ro ClusterIP 10.96.174.223 \u0026lt;none\u0026gt; 5432/TCP 11d postgres16-rw ClusterIP 10.96.134.59 \u0026lt;none\u0026gt; 5432/TCP 11d postgres17-lb LoadBalancer 10.96.222.51 10.90.3.210 5432:32310/TCP 17m postgres17-r ClusterIP 10.96.108.178 \u0026lt;none\u0026gt; 5432/TCP 17m postgres17-ro ClusterIP 10.96.6.175 \u0026lt;none\u0026gt; 5432/TCP 17m postgres17-rw ClusterIP 10.96.156.130 \u0026lt;none\u0026gt; 5432/TCP 17m Here you can see that we have matching sets of postgres services with the old one at 10.90.3.203 and the new one at 10.90.3.210 (Your IPs will vary based on your cluster)\nMoving apps to the new cluster I use VS Code for interacting with my code for my cluster. all I need to do is a search for uses of postgres16-rw.database.svc.cluster.local to find all the apps using Postgres.\nFrom here I will pick an single app and and change it over to postgres17-rw.database.svc.cluster.local Push the change and confirm it worked\nI will use Sonarr for my test.\nWhen I log into Sonarr and go to Status https://yourSonarURL/system/status you will see in the about section that its currently connected to Postgresql 16.3\nI need to edit my externalSecret to move the host for Postgres to the new cluster. Because I used all the same usernames and passworda as my original cluster, nothing else needs to change\nIn Sonarr\u0026rsquo;s externalsecret.yaml I change the host\n1 2 3 4 5 6 7 8 9 10 11 12 13 metadata: name: sonarr spec: secretStoreRef: kind: ClusterSecretStore name: onepassword-connect target: name: sonarr-secret template: engineVersion: v2 data: SONARR__AUTH__APIKEY: \u0026#34;{{ .SONARR_API_KEY }}\u0026#34; SONARR__POSTGRES__HOST: \u0026amp;dbHost postgres17-rw.database.svc.cluster.local run the following command to confirm the change to the secret\n1 kubectl describe externalsecret sonarr -n downloads Be sure to swap our the name of the secret and the namespace to match your app and namespace.\nYou should see the host has changed\nGo and refresh your Sonarr URL and you should see it changed there too\nIf you want to be doubly sure, delete your sonarr pod and let k8s recreate it and check again that everything looks good. Once you are happy, migrate your other apps\nAll Apps Migrated Once you have migrated all your apps and you can confirm that they are working, you can update your new Cloudnative-PG cluster to remove the InitDB and externalCluster settings you had that pointed to the old external cluster and go back to just having the standard settings you have for Volsync recovery:\n1 2 3 4 5 6 7 8 9 bootstrap: recovery: source: \u0026amp;previousCluster postgres17v1 # Note: externalClusters is needed when recovering from an existing cnpg cluster externalClusters: - name: *previousCluster barmanObjectStore: \u0026lt;\u0026lt;: *barmanObjectStore serverName: *previousCluster Scale down / removal of old cluster At this point, I want scale down your old cluster. Leave all the files there and let it sit for a week or two so you can be 100% sure everything is running fine, before you remove it from your Git\nNote Because we are running two clusters under Cloudnative-PG we cannot simply scale the deployment to 0, as this will scale down both clusters. We also can not set the instances to 0 as that is not valid configuration option for Cloudnative-PG Instead, we need to comment out the KS that sets up the cluster\nIn your root ks.yaml file kubernetes/apps/database/cloudnative-pg/ks.yaml Comment out the section for the original cluster\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 # --- # yaml-language-server: $schema=https://kubernetes-schemas.pages.dev/kustomize.toolkit.fluxcd.io/kustomization_v1.json # apiVersion: kustomize.toolkit.fluxcd.io/v1 # kind: Kustomization # metadata: # name: \u0026amp;app cloudnative-pg-cluster # namespace: flux-system # spec: # targetNamespace: database # commonMetadata: # labels: # app.kubernetes.io/name: *app # dependsOn: # - name: cloudnative-pg # - name: external-secrets-stores # path: ./kubernetes/apps/database/cloudnative-pg/cluster # prune: true # sourceRef: # kind: GitRepository # name: home-kubernetes # wait: true # interval: 30m # retryInterval: 1m # timeout: 5m # postBuild: # substitute: # APP: *app # GATUS_SVC_NAME: postgres-lb # GATUS_SVC_PORT: \u0026#34;5432\u0026#34; # GATUS_NAMESPACE: database Note: You may need to remove the finalizers on the PVC which have protection for Cloudnative-PG in order for the cleanup to complete\n1 kubectl get pvc -n database 1 2 3 4 5 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE postgres16-1 Terminating pvc-753ef28f-9f40-44aa-bc33-24f3db0abbfb 20Gi RWO openebs-hostpath \u0026lt;unset\u0026gt; 11d postgres17-1 Bound pvc-df4cbbb0-a2f2-4eb5-8377-09c68f6f8942 20Gi RWO openebs-hostpath \u0026lt;unset\u0026gt; 62m postgres17-2 Bound pvc-696e5ed8-7a0a-4640-8e3f-55d36d69343c 20Gi RWO openebs-hostpath \u0026lt;unset\u0026gt; 56m postgres17-3 Bound pvc-fd1344dd-f596-4ed3-a350-718c69c161f3 20Gi RWO openebs-hostpath \u0026lt;unset\u0026gt; 52m 1 kubectl edit pvc postgres16-1 -n database Remove the finalizer lines and save\nDone You have successfully migrated your Database.\nFinal Checks Log into your replicationDestination (Backblaze in my case) and check that the first backup has taken place Check all your pods and make sure none are throwing errors Continue to monitor your cluster over the coming weeks for any DB related errors Once you are totally happy, remove all the old files from Git. ","date":"2025-02-01T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2025/migrating-database-clusters/","title":"Migrating Database Clusters"},{"content":" \u0026ldquo;The difference between something good and something great is attention to detail.\u0026rdquo;\n— Charles R. Swindoll\nProtecting Your Investment Insurance So you just spent all this money are a new car, and if your smart, you will want to protect that investment. There are a few things you can do.\nFirst of all, you should have insurance before you even drove the car away. There are many insurance companies out there and dealing with this is a Royal PITA. In my experince, it is much better to deal with a broker, they act on your behalf and communicate with all the different insurance companies to find the best deal that suits your needs. They do no charge you, as the insurance companies pay them to bring in new clients, and the deals they offer you are always better than if you went direct to the Insurance company.\nWe use Hutchison Rodway Insurance They have been awesome and handle our House, Contents and Vehicle Insurance. We always deal with Jo Mitchel, feel free to call or email her\n⚠️ NOTE: If your vehicle has a glass roof, ENSURE that you Insurance company counts it as a windshield so you get one free replacement per year if it gets cracked/damaged\nLong Term Protection Once you have the car, you may choose to protect the paint and body work in a manner designed to keep it looking good for years to come. There are several different things you can take a look at:\nVehicle Wrap Purpose: Changes the car\u0026rsquo;s appearance (color, patterns, or designs) and offers light protection. Material: Vinyl film. Protection Level: Minor protection against scratches, chips, and UV rays. Customization: High—can include matte, gloss, satin, carbon fiber, or custom designs. Durability: 3–7 years, depending on quality and care. Cost: Moderate to high, based on coverage and design complexity. Best For: People looking to customize their car’s look or advertise a brand while adding light protection. Paint Protection Film (PPF) Purpose: Provides strong protection against scratches, chips, and abrasions. Material: Transparent polyurethane or polymer film. Protection Level: High—absorbs impacts and can self-heal minor scratches. Customization: Clear, but some brands offer matte or colored variants. Durability: 5–10 years with proper care. Cost: High—especially for full-body applications. Best For: People prioritizing protection for high-end, exotic, or new cars to maintain factory paint quality. Ceramic Coating Purpose: Provides a long-lasting, hydrophobic (water-repellent) layer for shine and easier cleaning. Material: Liquid polymer that chemically bonds to the paint. Protection Level: Protects against UV rays, dirt, and chemical stains, but not scratches or chips. Customization: None—focuses only on enhancing the car\u0026rsquo;s natural paint and gloss. Durability: 2–5 years, depending on maintenance. Cost: Moderate. Higher initial cost but lower maintenance than wax or sealants. Best For: People who want a glossy finish, easy maintenance, and protection against dirt and weather. Tinted Windows **Purpose: Improves privacy, reduces glare, blocks UV rays, and keeps the car cooler. **Key Features: Privacy – Makes it harder for others to see inside the car. UV Protection – Blocks up to 99% of harmful UV rays, reducing skin damage and fading of interiors. Heat Reduction – Helps keep the interior cooler by reflecting solar heat. Glare Reduction – Reduces glare from sunlight and headlights, improving visibility and comfort. Safety – Helps hold shattered glass together in case of an accident. Style – Enhances the appearance of the vehicle. Types: Dyed Film – Budget-friendly, reduces glare, and adds privacy but offers less heat reduction. Metalized Film – Reflects heat well and strengthens windows but may interfere with GPS and radio signals. Carbon Film – Provides excellent UV protection and heat rejection without signal interference. Ceramic Film – Premium option with the best heat reduction, UV protection, and clarity, and no signal interference. Costs (approx NZD) Full Car Ceramic = $800 Full car PPF = $5000 - $10,000 (Soo many variables) Vehicle Wrap = $5000+ Tinbts = $400 - $1000 Options I decided to go with a mixture of the following\nPF M8 Satin/Matte PPF - Full external vehicle HV2 Ceramic Coating - Full external vehicle over PPF PTL50 Ceramic Tints @ 47% (Sides and rear window) Interior fabric hydrophobic treatment Applied to seats, door cards \u0026amp; dash (all ‘leather’ trims) Chrome delete front \u0026amp; rear (T ,TESLA \u0026amp; DualMotor) Car Detailing Keep an eye out for our next post which will cover car cleaning\n","date":"2025-01-14T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2025/car-care/","title":"Car Care"},{"content":"Finance woes So, it turns out that in order to take advantage of the 2.99% deal that Tesla was doing with UDC Finance, I had to purchase from inventory.\nThat put a bit of a kibosh on my plans. The transaction also had to be completed and I had to have taken delivery of the vehicle by 31st of December 2024.\nAs it happens, no Model 3 Performance in stock, so our choices were limited. Decisions had to be made.\nMy wife and I sat down with the salesmen from Tesla and took a look through all the available inventory.\nInventory options (Model 3 Highland):\nStandard Range RWD (Rear Wheel Drive) Pearl White Multi-Coat Solid Black Stealth Grey Long Range AWD (All Wheel Drive) Pearl White Multi-Coat Solid Black Stealth Grey Deep Blue Metallic I really liked the Stealth Grey and my wife did not. We both likes the Deep Blue Metallic\nThere was one problem though. The interior, was white. I had never owned a white interior and was worried about the cleanliness of it. However, after talking to Hagan at The Wrap Shop and some research online, we were confident that the Ceramic Protection (along with regular cleaning and maintenence) would keep them looking good.\nFinance woes #2 Originally when I spoke to UDC, they said they would be able to add any 3rd party items to the finance deal e.g. PPF (Paint Protection Film), Tints etc.\nHowever, as we closed in on completing the deal, they went back on that statement. It turns out that Tesla NZ lost money on all inventory sold when offereing 2.99% and their deal with UDC is such that they will not budge and add anything else to the loan that may cost them more money.\nShit\u0026hellip;\nThis has really put a dent in my plans and I was going to have to do some quick juggling of things.\nWe put though the car sale, and traded in the Skoda. Taking collection on Friday 27th December at 3:30pm\nTesla Model 3 Long Range Dual Motor All-Wheel Drive Deep Blue Metallic Paint 18\u0026quot; Photon Wheels White Premium Interior Autopilot Enhanced Auto Pilot Full Self-Driving Capability So, what are the real world differenced between the originally ordered M3P and the actually delivered M3LR?\nFeature Model 3 Performance Model 3 Long Range AWD Acceleration (0–100 km/h) 3.1 seconds 4.2 seconds Top Speed 261 km/h 233 km/h Range (WLTP) ~528 km ~629 km Power Output ~343 kW ~366 kW Wheels 20-inch forged wheels with performance tires 18-inch (optional 19-inch) wheels Brakes Larger brakes for enhanced stopping power Standard brakes Suspension Adaptive suspension with performance tuning Standard suspension Seats Ventilated/heated sport seats with enhanced bolstering Ventilated/heated sport seats Interior Trim Carbon fiber trim Standard premium interior Driving Modes Track Mode for customizable dynamics Standard driving modes Audio System Premium 17-speaker system Premium 17-speaker system Spoiler Carbon fiber spoiler None So key take-aways are that we gained 101kms of WLTP range and lost a little top speed and acceleration.\nAdditionally, what that table does not show is that the tires are much larger on the M3P and much more expensive (~ NZD$ 700 per tire, vs NZD$ 400 for the M3LR)\nNext steps Next steps are getting some upgrades done on the car.\nProfessionally Installed Installing my Genevo Pro II (which was removed from the skoda) Front PPF Bumper Bonnet Headlights Front Fenders Wing Mirrors Full Car Exterior Ceramic Coating Tints Ceramic Tints Blocks 99% of UV 2x the heat reduction over normal tints Interior fabric hydrophobic treatment Seats Door cards Dash (all ‘leather’ trims) Badge Blackout Front Tesla Logo Rear Tesla Letters Personally Installed Tesla\nMobile Connector Tyre Repair Kit Premium Connectivity Hansshow\n6 Piece Mudflap Set Swivel Mount for Center Console Touch Screen Enhauto\nS3XY Knob (with Commander) 2 Add-on S3XY Buttons Gen2 BONUS: At the time of ordering, they were giving away 2 extra buttons so I have 4 Stickers With Icons Samsung\nT7 Shield 1TB Rugged Portable External SSD - Blue For Sentry Mode, etc Aliexpress\nGlove Box Flocking USB HUB Adapter Gives me more USB ports for data and charging Temu\nCenter Console Organizer for Tesla Model 3 Highland Sliding style 128W QC3.0 High Power Fast Charger 3USB 1PD Car Charger Portable 3 Socket The Enhauto S3XY Commander takes up the only 12v accessory plug so this was needed Silicone Center Console Organizer Tray Storage Box Convenient place to store sunglasses 2x ar Cup Holder Tablet Phone Mount with Heavy Duty Cupholder Base For the kids when they want to do things other than what is built into the Tesla\u0026rsquo;s rear screen Tessories NZ\nBundle: Model 3 (2024-2025) Floor Mats + Door Pocket Inserts Bundle: Model 3 (2024-2025) Liner Set Boot Grocery Hook Silicone Cup Holder Liner – Model 3 (2024-2025) Tesla Jack Pads Apps Vehicle Management / Analytics / Diagnostics Tesla - Required to use the car. Acts as your phone key iOS | Android Tessie - The Tesla management platform. Trusted by over 400,000 Tesla drivers. Website | iOS | Android Navigation A Better Route Planner - Optimize your EV journey with efficient routes, charging stop management, and live navigation for stress-free electric vehicle travel. Website | iOS | Android Charging ChargeNet - New Zealand\u0026rsquo;s EV fast-charging network Website | iOS | Android PlugShare - Find EV charging stations with PlugShare, the most complete map of electric vehicle charging stations in the world! Website | iOS | Android Z Energy - Z Public Charging Stations — At a 180kW Ultra-Fast Charger, You Get Roughly 100km of Range Every 8 Minutes of Charging. Website | iOS | Android bp charge - Charging Points Across NZ — Find Your Nearest EV Charger. Website | iOS | Android Zero Charging - Zero is powered by Meridian Energy. A New Zealand power company that generates electricity through 100% renewable sources – wind, water and sun. Website | iOS | Android If you install your charger outdoors this could earn you a little extra cash Open Loop - EV Charge Point Owners can offer their EV charger for general public use through the OpenLoop platform and app Website | iOS | Android End of month Come back soon to see updated pictures of the car once all the work is done!\n","date":"2024-12-27T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2024/new-car-delivered/","title":"New Car Delivered"},{"content":"Prequel Back in March my friend/colleague Charles and I (and our respective families) went down to Warbirds over Wanaka in the beautiful Queenstown-Lakes District to see the Airshow.\nWe both owned cars that we enjoyed driving, but with a tendency towards a lead foot, not cheap to run 😅. Since we each needed to rent a car, we took the opportunity to hire an EV. We\u0026rsquo;d heard so much about them but never actually driven one.\nWe both hired a BYD Atto 3 and were looking forward to the experience.\nOne week out from our holiday we both got an email from the rental company, they would be unable to supply us with our cars as both had been badly damaged in accidents\u0026hellip; However, they offered us both FREE upgrades to a Tesla Model 3! Hell yea!\nWe were in Wanaka for 1 week and spent a significant amount of time driving around in the Model 3, making many trips backwards and forwards across the Remarkables between Wanaka and Queenstown.\nNeedless to say, we were both very impressed with the Model 3, despite being a rental (rental cars are heavily mistreated and are typically not in great condition)\nHere are a few of the things that stood out to me:\nLow center of gravity - Having the batteries as a long wide block across the center 2/3 of the car keeps the weight very low. This drastically improves handling Balanced distribution of weight - Not having a heavy engine in the front meant that the car was evenly balanced front to back (Front: 856kg Rear: 904kg) Quiet - I noticed little to no road noise when driving (except where the road surface was really bad). This made it easy to have conversations with people in the car without raising our voices Storage - Despite the boot being smaller than the car we currently own (2021 Skoda Superb Sportline Wagon) with the added space in the Frunk (bonnet) for storage, the capacity was similar Infotainment - This was a real pleasure to use, the UX (user experience) is fantastic, well thought out, easy to use, AND regularly updated Performance - Even though we were driving the base model, the instantaneous torque was impressive to say the least, especially when pulling out of corners on the open highway. For me, this experience flipped a switch in my head, putting me into \u0026ldquo;research\u0026rdquo; mode, Kicking off approximately 9 months of deep research into the Teslas and whether they would be a good fit as a replacement car for my family and I.\nGripes with my current car Fuel is getting increasingly expensive with 95 Octane costing $2.75 per L with a 70L tank. I\u0026rsquo;m spending Nearly $200 per fill and filling twice a month The car is huge, and parking it is a pain The Infotainment SUCKS! It\u0026rsquo;s really slow to start Apple CarPlay lags like crazy The Radio still doesn\u0026rsquo;t show the names of radio stations, because the importer won\u0026rsquo;t allow it for privacy reasons\u0026hellip; It had the steering wheel replaced multiple times because radar-guided cruise control stopped working I was without an engine cover for a year while Skoda redesigned it to prevent fires in the engine bay Don\u0026rsquo;t get me wrong, it\u0026rsquo;s a good car. But a lot about it is frustrating. It was time for a change.\nTesla concerns Everyone watches videos online and hears all the negative things about Tesla. It\u0026rsquo;s easy to have a strong bias against them without actually having had any experience with one. But holding onto that bias is the place of fools. I wanted to find out the real story and I did my research.\nFirst, I wanted to make sure that Tesla was the right choice from a pure specifications perspective. I spent a lot of time in Google Sheets, pulling in specification data and comparing all the cars in the market. Looking at weight, length, width, boot space, price, power/torque, warranty, range etc. It was an exhaustive list.\nThis affirmed that Price : Spec, nothing beat out the Tesla.\nSo the next question is build quality. Now I am a Quality Assurance Engineer by trade, having done Software, Hardware, and Firmware testing. I know that nothing is ever perfect. However, I\u0026rsquo;d heard some nasty stories about build quality and I wanted to do some digging.\nFirst, I want to cut Tesla some slack, they are a \u0026ldquo;new\u0026rdquo; company in the car manufacturing game, and people are unfairly comparing the build quality with companies that are 100+ years old. Those early 5+ years will be shaky for any brand as they refine their processes. That said, I did hear a lot about panels not lining up and that was my greatest concern (I had not seen anything about saftey being a concern).\nAfter a lot of digging, this is what I found out.\nThe majority of the issues existed only for customers in the US, taking delivery of cars built in the Fremont/California factory. Very few issues with the cars out of Beijing, China (Which is where New Zealand gets its Teslas from) 99% of these issues were not present on the 2023+ models and I\u0026rsquo;d heard of no issues on the Hardware 4 \u0026ldquo;Highland\u0026rdquo; Model 3\u0026rsquo;s There are exhaustive checksheets you can get for delivery day to make sure everything is good and you don\u0026rsquo;t drive away if there are issues You have 72 hours post delivery to bring up any issues that would cause a delivery rejection, beyond that you fall back to issues being resolved under warranty NZ has a STRONG Consumer Guarantees Act that protects the consumer from issues Fast forward to 2024-Dec-03 My Wife and I had been away over the weekend for a surprise trip to Waitomo to see the glow worm caves and today we had planned to go into Tesla Auckland South to see the Model 3 Performance (Highland) and take it for a Test drive\nWe wanted to see how it performed, what it looked like in the flesh, and if we liked the changes.\nLet me tell you, The INSANE acceleration profile lives up to the name. I had never felt my internal organs press against my spine until that moment, it was amazing‼️ 😍\nBack at the dealership and chatting with the Sales guy I learned some useful information.\nTesla had just increased its referral discount to 1,600 for the purchaser and 800 in Tesla credit for the referrer Tesla also used UDC Finance Tesla was offering 2.99% while my existing car was at 7.98% Tesla using UDC made this choice even easier as my Skoda was financed with UDC. Here is how it works\nOrdering Process Everything is done inside the Tesla App on your phone You purchase the car through Tesla and pay the (100% refundable if finance or Trade-in fails) deposit of $ 400 Choose UDC for the Finance You tell Tesla that you are trading in a car You let them know that money is owed on the car and that it\u0026rsquo;s with UDC Tesla will give you a trade-in valuation on your car If you accept, Tesla will subtract the amount owing on your existing loan, pay it off to UDC on your behalf, and give you the balance of the trade-in as a reduction in the total loan amount. Example (arbitrary numbers for easy math):\nThe new car is worth $100,000 Your trade-in is valued at $50,000 You still owe $25,000 on the trade-in Total new finance = NewCarPrice - (Trade-in value - OldCarBalance) Total new finance = $100,000 - ($50,000 - $25,000) = $75,000 The next day, I contacted a colleague at work and asked for their Referral code to order a new Tesla.\nThe Car Tesla Model 3 Performance Dual Motor All-Wheel Drive Ultra Red Paint 20\u0026quot; Warp Wheels Black Premium Interior Autopilot Enhanced Auto Pilot Full Self-Driving Capability I decided FSD (Full Self-Driving Capability) was not worth the cost.\nDelivery should be between Jan-Feb 2025 (March if I am unlucky)\nFollow along for more posts as we get closer to the time and I talk more about how I plan to pull data off the car into my Kubernetes cluster for analytics\n","date":"2024-12-13T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2024/ordering-a-new-car/","title":"Ordering a new car"},{"content":"Starting a new blog Kevin Durbin and I have been talking about starting a new blog for a while. This is my attempt to kick things off!\nA little about me.\nHi, I am Gavin :)\nI am a QA (Quality Advocate) I help my team build better quality software. I am a full time Nerd with diverse hobbies from Cars to 3D Printing to Kubernetes. This blog will be a place for me to write things that matter to me, and hopefully, you\u0026hellip;\nMore soon!\n","date":"2024-12-13T00:00:00+13:00","permalink":"https://blog.nerdz.cloud/2024/starting-a-new-blog/","title":"Starting a new blog"}]