Hardware Failure

Earlier this week one of the nodes in the Proxmox cluster failed. It was powered on, but it wasn’t available on the network. It was also giving no video on any of the HDMI ports.

This node is the one that I was doing GPU passthrough into a Windows 10 VM for gaming. That quit working at one point a few months ago. The VM stopped booting and I disabled the GPU passthrough. I assumed it was due to a kernel update breaking the nvidia kernel module. I’m not doing any gaming on that VM these days, it’s only used as a lab machine for testing security software I’m evaluating.

Fast forward to this week. In my troubleshooting, I tried removing components and trying to resurrect it. No beeps on boot, memory looked good. Even removing one and both GPU cards didn’t bring it back. So I removed the CPU cooler and the CPU was cool to the touch after having it on for at least a few minute. My assumption was that the CPU died.

I bought a new one and replaced it, but still no video on boot. I did some Googling and found references for this particular motherboard for the same problem and someone suggested resetting the BIOS. So, I tried it and wow, got into the BIOS. Set the settings and got it to boot Proxmox back up.

I put the GPU card that was being used for passthrough and no video again. Ok, it seems that was the problem after all. Removing the card wasn’t enough to bring it back though, a BIOS reset was also needed. Alright, now I’ve got a backup CPU in case the CPU dies for real.

I learned a couple of things though. My high-availability configuration is not highly available. Given the constraints on network bandwidth between the nodes and the sensitivity of Longhorn to any network latency, I’d backed off of the shared storage between the nodes for VM disks. That means that despite having a highly available Proxmox cluster, half of the k3s nodes were down and couldn’t be migrated to the other node.

That wouldn’t be a problem except that I expected k3s to handle having nodes down by redistributing the workload to the remaining nodes. That didn’t happen. I’m not sure if it’s because 2 of 3 master nodes were on the downed node or some other reason.

Given how busy things are with the business and all of my attention is focused there for the time being, I’ll have to live with it the way it is and consider options to address this at some point. The possible solutions I’m considering are adding a 3rd node (not just a quorum voter node like I have now) and/or putting in a dedicated 10Gb network between the nodes. That should allow for shared storage and true high availability. VMs will be able to failover automatically when a node fails. Either option is not really within the reach of my budget for now.

State of the Blog

Almost every week I remember my promise to myself to write blog posts weekly. However, by the time the weekend rolls around I’ve found it difficult to motivate myself to do so. Mostly because during the week is focused on the business and I don’t have time for any new self-hosting projects beyond just keeping everything that’s already deployed updated and running smoothly.

I may take a different tactic of writing about other subjects. I want to write about the business challenges that I’m facing and how I’m solving them, but I also want to preserve my anonymity. That means anything I write about would need to be carefully cleansed of any details which could be cumulatively used by a dedicated individual to reveal my identity.

Not that anyone is seeking to track me, but I’m just paranoid like that. Or it’s good OpSec or something.