@bofh
So, I think you can try to configure higher token timeout in corosync.conf
maybe try with 10s:
token: 10000
thanks, will try.
little update on this. since i do not trust OVH or their ability to provide virtual networks i setup a second ring on tinc.
now its much more stable. had today NO corosync logentry in 2 of the nodes.
now wierdly enough, while 2 nodes dont report anything at all for today, 2 others nodes do
if theres a network issue i would have expected every node to report
like node 3 reports link0 on node 1 is down but node one says nothing at all.
now my best guess without knowing the code would be the following
- we all might have different causes for those issue but same symtoms
-corosync cant handle any network error properly
-corosync seems not to be able to even detect network errors or issues all the time properly
i guess they still rely much on multicast in their errorhandling and detection despite we do only have unicast now
-pve handles such errors not well either, resulting in freezes even when quorum still exist
-we should not rely or trust corosync to report, it might or might not.
now i also had an issue on another cluster with 2 nodes no HA.
one of the machine crashed (likely hardware issue) and was unresposive to anything (not even ssh) but still returned ping and seemed to return on corosync
now the result was that node 1 that worked properly freezed pve also completly.
oddly enough even after a reboot of node 1 (and didnt look at 2 at that time, silly me why checking the freezing node if you should check the others)
node 1 freezes imidiatly once corosync gets online
so as long the faulty node 2 was online, node 1 was unuseable. could not even qm list, or handle auth on the web.
regardless of what network issues we have, or how our nodes are crashing, it should never be that a faulty node takes everyone else with them.
that might be based on corosync but also pve messed up here in a big way
oddly enough, while i still have said network issues and those 2 nodes, of the 4machine cluster, report links down
i do not have any longer total freeze and the must to restart corosync manually. the second ring (even it also reports down sometimes) did at least solve the issue with pve and the cluster seems now to take those errors more resilient