Hi all
Just some info about our environment:
* Proxmox 7.4-3 Enterprise
* 4 nodes in 1 cluster -
* 1 dedicated for GPU workloads and as "extra resource when patching etc."
* 3 others for all other workloads.
* Each of the three are running with a 256 x AMD EPYC 7742 64-Core Processor (2 Sockets)
* Each of the three has 2 TB of RAM
* All hosts are connected to a Ceph cluster for shared storage
* We have around 110 VM's in the cluster, and in general it's highly overpowered - The cluster summary says around 1-2% of CPU and less than 10 % Memory usage.
* This is a VM only setup (no containers)
Some weeks ago we had a hardware crash on one of our non-GPU nodes in the cluster. We realized that the VM's where not being rebooted on the other hosts as we might expect when creating a cluster.
We investigated, and ended up creating a group called "prod" in the HA groups, adding the 3 non-GPU nodes to this HA group, and enabling HA on 3 VM's adding them to the newly created group.
This setup had been running for some weeks, and now we decided to enable HA on all VMs, as would like to protect all the VM's.
So I did something like this from one of the nodes: for i in (seq 100 210); do ha-manager add vm:$i ; done
Approximately 10 minutes after this (We saw that HA was being enabled on all VM's with "pvesh get cluster/resources --type vm"). I noticed CPU usage went to 100%, and then the nodes began to reboot, one after the other. Not shutting down VM, just rebooting. ILO revealed a "reset".
We ended up in a total disaster with all nodes down and a lot of disk locks on the ceph storage system when the nodes finally came up, preventing the VMs from booting.
I assume we where hit by fencing, maybe because of heartbeat timeouts do to high CPU.
I'm pretty amazed that trying to enable HA, caused my entire cluster to go down!
So, my questions:
* Did I do something wrong?
* Should I disable fencing before enabling HA?
* Can I add a pool to a HA group, to protect all VM's
Hopefully someone might pinpoint what went wrong, and how I can enable HA in a safely manor.
Best regards, Kasper
Just some info about our environment:
* Proxmox 7.4-3 Enterprise
* 4 nodes in 1 cluster -
* 1 dedicated for GPU workloads and as "extra resource when patching etc."
* 3 others for all other workloads.
* Each of the three are running with a 256 x AMD EPYC 7742 64-Core Processor (2 Sockets)
* Each of the three has 2 TB of RAM
* All hosts are connected to a Ceph cluster for shared storage
* We have around 110 VM's in the cluster, and in general it's highly overpowered - The cluster summary says around 1-2% of CPU and less than 10 % Memory usage.
* This is a VM only setup (no containers)
Some weeks ago we had a hardware crash on one of our non-GPU nodes in the cluster. We realized that the VM's where not being rebooted on the other hosts as we might expect when creating a cluster.
We investigated, and ended up creating a group called "prod" in the HA groups, adding the 3 non-GPU nodes to this HA group, and enabling HA on 3 VM's adding them to the newly created group.
This setup had been running for some weeks, and now we decided to enable HA on all VMs, as would like to protect all the VM's.
So I did something like this from one of the nodes: for i in (seq 100 210); do ha-manager add vm:$i ; done
Approximately 10 minutes after this (We saw that HA was being enabled on all VM's with "pvesh get cluster/resources --type vm"). I noticed CPU usage went to 100%, and then the nodes began to reboot, one after the other. Not shutting down VM, just rebooting. ILO revealed a "reset".
We ended up in a total disaster with all nodes down and a lot of disk locks on the ceph storage system when the nodes finally came up, preventing the VMs from booting.
I assume we where hit by fencing, maybe because of heartbeat timeouts do to high CPU.
I'm pretty amazed that trying to enable HA, caused my entire cluster to go down!
So, my questions:
* Did I do something wrong?
* Should I disable fencing before enabling HA?
* Can I add a pool to a HA group, to protect all VM's
Hopefully someone might pinpoint what went wrong, and how I can enable HA in a safely manor.
Best regards, Kasper