High Availability is just too unstable, nodes constantly turn off and on

LunarMagic

Member
Mar 14, 2024
46
4
8
I don't know why HA keeps messing up so much. I've tried this twice now. After i enable HA on all of the virtual machines i then get nodes randomly rebooting nonstop. I tried changing settings in HA and i just can't get this to stop. When i turn off HA all of my nodes are perfectly fine1738104115578.png
 
The classic suspicion is that your corosync network is weak, has high latency or is not separate from a (congested) "main" network cable. Remember that VLANs do NOT have separate wires per LAN ;-)

Then the obvious recommendation would be to establish physically one (or two) independent cable-networks exclusively for corosync...


Disclaimer: pure guessing...
 
The classic suspicion is that your corosync network is weak, has high latency or is not separate from a (congested) "main" network cable. Remember that VLANs do NOT have separate wires per LAN ;-)

Then the obvious recommendation would be to establish physically one (or two) independent cable-networks exclusively for corosync...


Disclaimer: pure guessing...
Each one of the machines are on their own switch together with their own connections
 
Each one of the machines are on their own switch together with their own connections
Okay, great!

Of course there may be several causes of a cluster reboot - think a glitch on the power line.

If the reason was HA --> fencing you could find entries about it in the journal, similar like this:
Code:
~# journalctl   --grep fenc
-- Boot 2052e14b8b124d3e8cca747c9c998d64 --
-- Boot 8a67ec26cb82418592d5b22a0cd97c3d --
-- Boot 46c01a76137346028f356e323f5f8bcd --
Dec 18 19:03:27 pvem pve-ha-crm[2232]: node 'pvei': state changed from 'unknown' => 'fence'
Dec 18 19:04:28 pvem pve-ha-crm[2232]: fencing: acknowledged - got agent lock for node 'pvei'
Dec 18 19:04:28 pvem pve-ha-crm[2232]: node 'pvei': state changed from 'fence' => 'unknown'
Look for it on some nodes. If you find some examine the journal entries a few minutes before that timestamp.

To check live corosync connectivity I use
Code:
~# corosync-cfgtool -n
Local node ID 10, transport knet
nodeid: 2 reachable
   LINK: 0 udp (10.3.16.13->10.3.16.9) enabled connected mtu: 1397
   LINK: 1 udp (10.11.16.13->10.11.16.9) enabled connected mtu: 1397
...
Verify that all of the expected nodes are listed and have consistent settings.
 
  • Like
Reactions: Johannes S