Hello community!
A few days back, I experienced an issue with automatic recovery of my HA Proxmox cluster.
We're using a 3 node cluster with manual fencing.
There was a short network failure because of a broken switch power adaptor.
After the switch worked again, the cluster reconnected, but fence_tool showed the "wait state messages".
Turned out, everything worked but DLM waited for fencing to occur. This also blocked operations that use the rgmanager, e.g. VM migration or starting.
After hard resetting (rebooting was not possible because of the blocking rgmanager) each node one after another, fence_tool was fine, but actions to the VMs like starting, migration etc was not possible (error code 1).
It seems, the cluster was still waiting for something. So hard resetting all nodes in the cluster simultaneously worked for me and all operations were working again.
But this can't be a proper solution for my productive cluster. I don't want non-HA VMs being resetted just because there was a network failure.
Is there any better way to repair the cluster considering the dependencies between the services? E.g. restarting services in a particular order, so they don't block each other or using specific CLI commands.
Thanks for your feedback!
My cluster.conf:
A few days back, I experienced an issue with automatic recovery of my HA Proxmox cluster.
We're using a 3 node cluster with manual fencing.
There was a short network failure because of a broken switch power adaptor.
After the switch worked again, the cluster reconnected, but fence_tool showed the "wait state messages".
Turned out, everything worked but DLM waited for fencing to occur. This also blocked operations that use the rgmanager, e.g. VM migration or starting.
After hard resetting (rebooting was not possible because of the blocking rgmanager) each node one after another, fence_tool was fine, but actions to the VMs like starting, migration etc was not possible (error code 1).
It seems, the cluster was still waiting for something. So hard resetting all nodes in the cluster simultaneously worked for me and all operations were working again.
But this can't be a proper solution for my productive cluster. I don't want non-HA VMs being resetted just because there was a network failure.
Is there any better way to repair the cluster considering the dependencies between the services? E.g. restarting services in a particular order, so they don't block each other or using specific CLI commands.
Thanks for your feedback!
My cluster.conf:
Code:
<?xml version="1.0"?>
<cluster config_version="14" name="MJ">
<cman keyfile="/var/lib/pve-cluster/corosync.authkey" transport="udpu"/>
<fencedevices>
<fencedevice agent="fence_manual" name="fenceProxmox01"/>
<fencedevice agent="fence_manual" name="fenceProxmox02"/>
<fencedevice agent="fence_manual" name="fenceProxmox03"/>
</fencedevices>
<clusternodes>
<clusternode name="proxmox-test-cluster1" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="fenceProxmox01"/>
</method>
</fence>
</clusternode>
<clusternode name="proxmox-test-cluster2" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="fenceProxmox02"/>
</method>
</fence>
</clusternode>
<clusternode name="proxmox-test-cluster3" nodeid="3" votes="1">
<fence>
<method name="1">
<device name="fenceProxmox03"/>
</method>
</fence>
</clusternode>
</clusternodes>
<rm>
<pvevm autostart="1" vmid="107"/>
</rm>
</cluster>
Last edited: