Help me do better next time: big outage due to LVM issues and ha-manager freeze

blablub44

New Member
Oct 8, 2024
2
0
1
Hey all,

I just got through an outage of my 4-node cluster and I'm trying to learn where it all went wrong. Feel free to roast me regarding bad decisions, I want to learn.

Quick Overview:
  • 4 nodes, Storage is a mixture of Ceph (phasing out due to unfeasibility in 4-node cluster) and LVM Shared on a FC SAN.
  • A handful of workloads are enrolled in ha-manager
  • I'm not too familiar with Proxmox, this is my first real cluster
I'm still working on my SAN config, and it all went to shit when I removed one LUN from the SAN side without removing it on the Proxmox nodes. this LUN hat no workloads on it whatsoever, it was removed from shared storage a few days earlier. It was however in my multipath config and was recognized by LVM on the nodes.
As soon as I removed the LUN, the entire LVM management plane freezes. pvs, lvs, lvdisplay, you name it - they all freeze indefinitely. Proxmox webUI seems to rely on those as well, as ALL VMs suddenly showed status "unknown" - even those that run on Ceph and don't touch LVM at all. (This might have already been a mistake on my side, but may i remark that this feels like a bad design - one broken LUN can kill the entire VM control?)

So, in order to not interrupt the workloads, I try online-migrating via CLI. The goal is to reboot node by node since I know this will fix LVM. Some Migrations work, others don't. Some don't even start. Most migrated machines are stuck in lock:migrate despite the migration being done and the workload running happily on the destination (deactivating the LV on the source node fails I guess?).

Rebooting also hangs indefinitely (systemd trying to deactivate LVs too I think), and so after 10+ mins, I "reboot -f". As far as I am aware, during all of this, corosync is healthy.
At some point during this, or maybe due to my forced reboots, HA-manager goes wild: manager is stale and re-election somehow does not progress. HA VMs are now in "freeze" state. I try to bypass HA-manager to get my workloads back up by
  1. setting them to disabled in ha-manager (not accepted)
  2. deleting them from ha-manager (not accepted, stuck in "deleting")
  3. finally stopping pve-ha-lrm then pve-ha-crm on all cluster nodes (no success - qm start still does not start the vm)
Now for the crescendo, two nodes are stuck with the "systemctl stop pve-ha-lrm" command. after waiting maybe 10 minutes, I pkill the processes on one machine and BOOM, two (!) of my nodes immediately reboot. Last message in Kernel log is "kernel: watchdog: watchdog0: watchdog did not stop!". Thankfully Ceph and all VMs on the SAN have survived this without corruption.

Sometime after, slowly starting pve-ha-crm and lrm on a few machines, ha-manager works again and I remove all my workloads from it. The situation stabilizes, I can now control my VMs again.

Thanks for face-palming through this with me, feel free to let me know what I did wrong at each step.
 
Last edited: