This issue is somewhat related to this thread, but not really. Bear with me.
Setup:
Large (too large - and currently in the process of splitting the cluster) pve cluster containing 35 nodes running pve 7.4-16. Kernel pinned to 5.13.19-15.
3 corosync rings as in the related thread
No HA-enabled VM's/containers in the cluster.
Issue:
One pve node(11) had a memory DIMM error causing the node to reboot.
This node leaves the cluster:
And then a second node reboots:
I suspect this node (26) was the quorum master. In the related thread we saw that the quorum master would fence itself even if it had no HA-enabled services. We also got it confirmed that when there are no HA-enabled services in the cluster, the watchdogs are not armed an no fencing should occur.
I am just puzzled why node 26 choose to reboot. All nodes are stating "loop take too long". There are no apparent hardware issues on node 26.
My theory is that node 26 was the quorum master and rebooted for some reason, but there are noe trace of watchdog or fencing messages it the logs - which I guess makes sense if the assumption regarding no armed watchdogs is correct.
Any insights or theories most welcome. I can of course provide more logs if required.
BR
Bjørn
Setup:
Large (too large - and currently in the process of splitting the cluster) pve cluster containing 35 nodes running pve 7.4-16. Kernel pinned to 5.13.19-15.
3 corosync rings as in the related thread
No HA-enabled VM's/containers in the cluster.
Issue:
One pve node(11) had a memory DIMM error causing the node to reboot.
This node leaves the cluster:
Code:
Feb 7 09:11:19 pve270 corosync[2052]: [QUORUM] Sync left[1]: 11
Feb 7 09:11:19 pve270 corosync[2052]: [TOTEM ] A new membership (1.227f) was formed. Members left: 11
And then a second node reboots:
Code:
Feb 7 09:11:25 pve270 pvescheduler[2283914]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout
Feb 7 09:11:30 pve270 pve-ha-lrm[2116]: loop take too long (62 seconds)
...
Feb 7 09:12:28 pve270 corosync[2052]: [QUORUM] Sync left[1]: 26
Feb 7 09:12:28 pve270 corosync[2052]: [TOTEM ] A new membership (1.2283) was formed. Members left: 26
I suspect this node (26) was the quorum master. In the related thread we saw that the quorum master would fence itself even if it had no HA-enabled services. We also got it confirmed that when there are no HA-enabled services in the cluster, the watchdogs are not armed an no fencing should occur.
I am just puzzled why node 26 choose to reboot. All nodes are stating "loop take too long". There are no apparent hardware issues on node 26.
My theory is that node 26 was the quorum master and rebooted for some reason, but there are noe trace of watchdog or fencing messages it the logs - which I guess makes sense if the assumption regarding no armed watchdogs is correct.
Any insights or theories most welcome. I can of course provide more logs if required.
BR
Bjørn