High Availability is just too unstable, nodes constantly turn off and on

LunarMagic · Tuesday at 23:42

I don't know why HA keeps messing up so much. I've tried this twice now. After i enable HA on all of the virtual machines i then get nodes randomly rebooting nonstop. I tried changing settings in HA and i just can't get this to stop. When i turn off HA all of my nodes are perfectly fine

LunarMagic · Tuesday at 23:45

screenshot of the errors i get

UdoB · 2025-01-29T17:48:03+0100

The classic suspicion is that your corosync network is weak, has high latency or is not separate from a (congested) "main" network cable. Remember that VLANs do NOT have separate wires per LAN ;-)

Then the obvious recommendation would be to establish physically one (or two) independent cable-networks exclusively for corosync...

Disclaimer: pure guessing...

LunarMagic · 2025-01-29T17:55:56+0100

UdoB said:
The classic suspicion is that your corosync network is weak, has high latency or is not separate from a (congested) "main" network cable. Remember that VLANs do NOT have separate wires per LAN ;-)

Then the obvious recommendation would be to establish physically one (or two) independent cable-networks exclusively for corosync...

Disclaimer: pure guessing...

Each one of the machines are on their own switch together with their own connections

UdoB · 2025-01-29T18:19:39+0100

LunarMagic said:
Each one of the machines are on their own switch together with their own connections

Okay, great!

Of course there may be several causes of a cluster reboot - think a glitch on the power line.

If the reason was HA --> fencing you could find entries about it in the journal, similar like this:

Code:

~# journalctl   --grep fenc
-- Boot 2052e14b8b124d3e8cca747c9c998d64 --
-- Boot 8a67ec26cb82418592d5b22a0cd97c3d --
-- Boot 46c01a76137346028f356e323f5f8bcd --
Dec 18 19:03:27 pvem pve-ha-crm[2232]: node 'pvei': state changed from 'unknown' => 'fence'
Dec 18 19:04:28 pvem pve-ha-crm[2232]: fencing: acknowledged - got agent lock for node 'pvei'
Dec 18 19:04:28 pvem pve-ha-crm[2232]: node 'pvei': state changed from 'fence' => 'unknown'

Look for it on some nodes. If you find some examine the journal entries a few minutes before that timestamp.

To check live corosync connectivity I use

Code:

~# corosync-cfgtool -n
Local node ID 10, transport knet
nodeid: 2 reachable
   LINK: 0 udp (10.3.16.13->10.3.16.9) enabled connected mtu: 1397
   LINK: 1 udp (10.11.16.13->10.11.16.9) enabled connected mtu: 1397
...

Verify that all of the expected nodes are listed and have consistent settings.

waltar · 2025-01-29T19:40:27+0100

Verify that all of your nodes have same time and synchronised eg with chrony.

Search

Search

High Availability is just too unstable, nodes constantly turn off and on

LunarMagic

Member

LunarMagic

Member

UdoB

Distinguished Member

LunarMagic

Member

UdoB

Distinguished Member

waltar

Renowned Member

We value your privacy