proxmox ve ve 6.3 restart when GW not available

juniper

Renowned Member
Oct 21, 2013
84
0
71
Hi,

i don't know if it's an issue or my fault:

We have a 5 node Proxmox VE 6.3 cluster and when we make some update on our firewall/gw (when gateway for some time isn't available for example) , proxmox ve nodes reboots (all nodes on the same subnet obviously)

is this a normal behavior? do we need to add a ring link to our cluster?

With old version (multicast) all works fine when GW isn't available.

Thanks in advance.
 
How is your network and PVE cluster set up? Is it possible, that the network for the PVE cluster (/etc/pve/corosync.conf) is down if you reboot the gateway?

Do you have HA enabled? If you do have HA enabled and if a reboot of the GW means that the corosync network is not available for some time, it is expected behavior.

If the corosync communication from one node to the major part of the cluster is broken for about 2 minutes (IIRC), the node will fence itself (hard reset) to make sure that the HA guests on it are off for sure, so that the remaining node can safely start the HA guests without corrupting the VMs disk. If the corosync network does not work for all nodes, all nodes with HA guests will behave like that -> many nodes reboot.

If you don't use HA or if the corosync network is still functioning during a GW reboot, we will have to investigate further what is causing this.
 
How is your network and PVE cluster set up? Is it possible, that the network for the PVE cluster (/etc/pve/corosync.conf) is down if you reboot the gateway?

Do you have HA enabled? If you do have HA enabled and if a reboot of the GW means that the corosync network is not available for some time, it is expected behavior.

If the corosync communication from one node to the major part of the cluster is broken for about 2 minutes (IIRC), the node will fence itself (hard reset) to make sure that the HA guests on it are off for sure, so that the remaining node can safely start the HA guests without corrupting the VMs disk. If the corosync network does not work for all nodes, all nodes with HA guests will behave like that -> many nodes reboot.

If you don't use HA or if the corosync network is still functioning during a GW reboot, we will have to investigate further what is causing this.

No resources added to HA but we have two groups configured (an old test)

Gw and Cluster are on different switches and i think Cluster network works fine during GW reboot (but i have to test...)

Testing is difficult...