Hello there,
we are having troubles with our proxmox-Clusters and need your help.
We are running a PVE cluster with 29 Nodes and approxmiate 1500 VMs with configured HA. After a not yet understood failure the complete cluster rebooted and tried to start every Resource after quorum was reached. As the rebooting of the nodes took quite different amounts of time, the quorum returned and some resources were relocated to other nodes.
These Nodes then in regard restarted again due to high memory usage and the circle starts all over.
We now have all nodes of the cluster up and running and all VMs stopped, ceph is clean.
The Problem is now that the HA-State is kind of broken - see attached picture "HA-state-1.png"
What we have tried so far: As we are scared that if we fix the HA-Master node all Resources were started again so we tried to delete all HA-Resources via API - This worked for some resources but not for all.
At this moment it is not possible to stop,start,migrate any resource with HA configured. The Task "HA 123 - stop" is visible but the VM never gets stopped...
The Cluster is running pve-manager/7.4-3/9002ab8a at this moment. If there is any information i could provide just note it here and i will post it
Can someone provide a solution to completely remove all HA-Resources and get the cluster to a clean state?
Thank you in advance for any help
we are having troubles with our proxmox-Clusters and need your help.
We are running a PVE cluster with 29 Nodes and approxmiate 1500 VMs with configured HA. After a not yet understood failure the complete cluster rebooted and tried to start every Resource after quorum was reached. As the rebooting of the nodes took quite different amounts of time, the quorum returned and some resources were relocated to other nodes.
These Nodes then in regard restarted again due to high memory usage and the circle starts all over.
We now have all nodes of the cluster up and running and all VMs stopped, ceph is clean.
The Problem is now that the HA-State is kind of broken - see attached picture "HA-state-1.png"
What we have tried so far: As we are scared that if we fix the HA-Master node all Resources were started again so we tried to delete all HA-Resources via API - This worked for some resources but not for all.
At this moment it is not possible to stop,start,migrate any resource with HA configured. The Task "HA 123 - stop" is visible but the VM never gets stopped...
The Cluster is running pve-manager/7.4-3/9002ab8a at this moment. If there is any information i could provide just note it here and i will post it
Can someone provide a solution to completely remove all HA-Resources and get the cluster to a clean state?
Thank you in advance for any help