Hey guys,
We're facing an issue with one of our Proxmox cluster.
We have a 8 nodes setup, that completely fenced for "no reason" (I assume there's a good reason for this, but not an obvious one).
This cluster has a 2 Ring corosync network, and both are independant network fully monitored => It *shouldn't* be an issue on this side.
There's no big particularity on this setup, unless that one of the node (Compute004) was off since a few days, due to hardware maintenance. So we were on a 7 "UP & running" nodes setup.
Here's my understanding of the events :
Around 14:09:24, all the nodes starts to logs some "[TOTEM ] Retransmit". Theses retransmit lasts nearly 1 min, and then the whole cluster crashed, at 14:10:21. I assume that the quorum "failed" and tried to rebuild, during 1 min, then the watchdog kicked in on all the nodes, explaining why the whole cluster rebooted ?
I have no logs of any ring failure regarding corosync. But the corosync.conf uses the default "passive" rrp_mode. Could the network problem have been enough to make the quorum fail, and prevent it from rebuilding itself, without triggering the failover on the second ring?
If so, would it be safe to use the "active" parameter for rrp_mode, to ensure that both ring are used ? I know the main part of the solution to this crash is to find & fix the network problem that caused the quorum to crash, but we'd also want to ensure that if this problem happens again, our second rings "takes the wheel" properly.
PS : I can include some logs if it can be useful. I just don't know what logs, which logs among the 7 servers can be useful.
Thanks !
We're facing an issue with one of our Proxmox cluster.
We have a 8 nodes setup, that completely fenced for "no reason" (I assume there's a good reason for this, but not an obvious one).
This cluster has a 2 Ring corosync network, and both are independant network fully monitored => It *shouldn't* be an issue on this side.
There's no big particularity on this setup, unless that one of the node (Compute004) was off since a few days, due to hardware maintenance. So we were on a 7 "UP & running" nodes setup.
Here's my understanding of the events :
Around 14:09:24, all the nodes starts to logs some "[TOTEM ] Retransmit". Theses retransmit lasts nearly 1 min, and then the whole cluster crashed, at 14:10:21. I assume that the quorum "failed" and tried to rebuild, during 1 min, then the watchdog kicked in on all the nodes, explaining why the whole cluster rebooted ?
I have no logs of any ring failure regarding corosync. But the corosync.conf uses the default "passive" rrp_mode. Could the network problem have been enough to make the quorum fail, and prevent it from rebuilding itself, without triggering the failover on the second ring?
If so, would it be safe to use the "active" parameter for rrp_mode, to ensure that both ring are used ? I know the main part of the solution to this crash is to find & fix the network problem that caused the quorum to crash, but we'd also want to ensure that if this problem happens again, our second rings "takes the wheel" properly.
PS : I can include some logs if it can be useful. I just don't know what logs, which logs among the 7 servers can be useful.
Thanks !