Hey all,
I just got through an outage of my 4-node cluster and I'm trying to learn where it all went wrong. Feel free to roast me regarding bad decisions, I want to learn.
Quick Overview:
As soon as I removed the LUN, the entire LVM management plane freezes. pvs, lvs, lvdisplay, you name it - they all freeze indefinitely. Proxmox webUI seems to rely on those as well, as ALL VMs suddenly showed status "unknown" - even those that run on Ceph and don't touch LVM at all. (This might have already been a mistake on my side, but may i remark that this feels like a bad design - one broken LUN can kill the entire VM control?)
So, in order to not interrupt the workloads, I try online-migrating via CLI. The goal is to reboot node by node since I know this will fix LVM. Some Migrations work, others don't. Some don't even start. Most migrated machines are stuck in lock:migrate despite the migration being done and the workload running happily on the destination (deactivating the LV on the source node fails I guess?).
Rebooting also hangs indefinitely (systemd trying to deactivate LVs too I think), and so after 10+ mins, I "reboot -f". As far as I am aware, during all of this, corosync is healthy.
At some point during this, or maybe due to my forced reboots, HA-manager goes wild: manager is stale and re-election somehow does not progress. HA VMs are now in "freeze" state. I try to bypass HA-manager to get my workloads back up by
Sometime after, slowly starting pve-ha-crm and lrm on a few machines, ha-manager works again and I remove all my workloads from it. The situation stabilizes, I can now control my VMs again.
Thanks for face-palming through this with me, feel free to let me know what I did wrong at each step.
I just got through an outage of my 4-node cluster and I'm trying to learn where it all went wrong. Feel free to roast me regarding bad decisions, I want to learn.
Quick Overview:
- 4 nodes, Storage is a mixture of Ceph (phasing out due to unfeasibility in 4-node cluster) and LVM Shared on a FC SAN.
- A handful of workloads are enrolled in ha-manager
- I'm not too familiar with Proxmox, this is my first real cluster
As soon as I removed the LUN, the entire LVM management plane freezes. pvs, lvs, lvdisplay, you name it - they all freeze indefinitely. Proxmox webUI seems to rely on those as well, as ALL VMs suddenly showed status "unknown" - even those that run on Ceph and don't touch LVM at all. (This might have already been a mistake on my side, but may i remark that this feels like a bad design - one broken LUN can kill the entire VM control?)
So, in order to not interrupt the workloads, I try online-migrating via CLI. The goal is to reboot node by node since I know this will fix LVM. Some Migrations work, others don't. Some don't even start. Most migrated machines are stuck in lock:migrate despite the migration being done and the workload running happily on the destination (deactivating the LV on the source node fails I guess?).
Rebooting also hangs indefinitely (systemd trying to deactivate LVs too I think), and so after 10+ mins, I "reboot -f". As far as I am aware, during all of this, corosync is healthy.
At some point during this, or maybe due to my forced reboots, HA-manager goes wild: manager is stale and re-election somehow does not progress. HA VMs are now in "freeze" state. I try to bypass HA-manager to get my workloads back up by
- setting them to disabled in ha-manager (not accepted)
- deleting them from ha-manager (not accepted, stuck in "deleting")
- finally stopping pve-ha-lrm then pve-ha-crm on all cluster nodes (no success - qm start still does not start the vm)
Sometime after, slowly starting pve-ha-crm and lrm on a few machines, ha-manager works again and I remove all my workloads from it. The situation stabilizes, I can now control my VMs again.
Thanks for face-palming through this with me, feel free to let me know what I did wrong at each step.
Last edited: