6 node HA cluster split brain

ernestneo · Aug 1, 2024

I have a spontaneous reboot of the entire cluster when the link between 2 sites is disrupted.

Bash:

Cluster information
-------------------
Name:             HA1
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Aug  1 10:47:46 2024
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000001
Ring ID:          1.11f
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   6
Highest expected: 6
Total votes:      5
Quorum:           4 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 172.16.1.1 (local)
0x00000003          1 172.16.1.3
0x00000004          1 172.16.1.4
0x00000005          1 172.16.1.5
0x00000006          1 172.16.1.6

[3 nodes] < --x-- > [ 2 nodes , 1 node down]

will pvecm expected 2 helps in this scenario? while maintaining both split nodes available?

spirit · Aug 1, 2024

you need 3 sites if you want reliable HA.

UdoB · Aug 1, 2024

ernestneo said:
I have a spontaneous reboot of the entire cluster when the link between 2 sites is disrupted.

That's expected behavior. Each site has 3 votes - which is not "more than 50%".

Note that it triggers a reboot of a node only if High Availability is being used in at that specific node.

ernestneo said:
will pvecm expected 2 helps in this scenario? while maintaining both split nodes available?

This is a setting that really should only be used for desaster recovery - which may be the mode you have been in this morning.

Lying to the system that "expected 2" is fine lets both halfs of the cluster do what they want. With a good chance to create bad situations like starting the same VM on both sides.

So, yes: "expected 2" (or 3) will help get going. But be really careful...!

As @spirit already said: only a third site as a neutral witness with a connection path independent of the direct link between the two main sites will help to avoid this drama.

ernestneo · Aug 1, 2024

UdoB said:
Note that it triggers a reboot of a node only if High Availability is being used in at that specific node.

Thank you, clearer on the behaviour.
So with clustered node and no VM put on HA will prevent the cluster from killing itself should the link breaks happens?

Azunai333 · Aug 1, 2024

ernestneo said:
So with clustered node and no VM put on HA will prevent the cluster from killing itself should the link breaks happens?

Yes. Look at "Datacenter > HA > Status > Column Status". If the nodes are not "active" anymore, they shouldn't reboot anymore automatically.
This happens 10 minutes after the last HA of a VM is deleted or set on "ignored".

As the others said: If you want to use HA, be sure there is one side with >=51% left.

esi_y · Aug 1, 2024

ernestneo said:
I have a spontaneous reboot of the entire cluster when the link between 2 sites is disrupted.

What do you mean by "sites"?

ernestneo · Aug 1, 2024

esi_y said:
What do you mean by "sites"?

[3 nodes @ Site A ] [switch] ---- 100G ---- [switch] [ 3 nodes @ Site B]

ernestneo · Aug 1, 2024

Azunai333 said:
Yes. Look at "Datacenter > HA > Status > Column Status". If the nodes are not "active" anymore, they shouldn't reboot anymore automatically.
This happens 10 minutes after the last HA of a VM is deleted or set on "ignored".

As the others said: If you want to use HA, be sure there is one side with >=51% left.

Yup got it now, thanks again.

esi_y · Aug 1, 2024

ernestneo said:
[3 nodes @ Site A ] [switch] ---- 100G ---- [switch] [ 3 nodes @ Site B]

So I suppose this is OM3/OM4 link.

ernestneo said:
View attachment 72168

Yup got it now, thanks again.

Just be aware that due to a long outstanding bug [1], the "master" will still reboot one last time upon losing quorum after you had disabled all HA. You can reboot it manually to avoid that if that makes more sense to you.

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=5243

ernestneo · Aug 1, 2024

esi_y said:
So I suppose this is OM3/OM4 link.

Link is OS2, Campus cross connect, 2 redundant path, LACP.

However one of the fiber had issue (not a hard down) causing huge packet losses and the drama happened.

esi_y said:
Just be aware that due to a long outstanding bug [1], the "master" will still reboot one last time upon losing quorum after you had disabled all HA. You can reboot it manually to avoid that if that makes more sense to you.

Ah yes i read about that bug, thanks for filing it and the heads up.

esi_y · Aug 1, 2024

ernestneo said:
Link is OS2, Campus cross connect, 2 redundant path, LACP.

However one of the fiber had issue (not a hard down) causing huge packet losses and the drama happened.

Do you use at least two "rings" for the corosync or just LACP with a single one? Is that one switch or more with MLAG? If the latter, I would wonder about what was in the logs during the outage.

What's the reason you clustered it 3+3? Do you actually need HA across the two sites? Thinking of possible fail scenarios (e.g. switch dying), it's actually not more available, in fact it is less available than with HA off because you only lose half of the resources from any site if e.g. the link goes down, but you lose everything on both sites with HA on.

If you only need to migrate (i.e. your only reason to have it one 6-node cluster) across the two sites, you could make use of qm remote-migrate or PBS or manual ZFS send/receive, for instance. Alternatively you could have at least a Q device to remain quorate at least on one site.

ernestneo · Aug 1, 2024

esi_y said:
Do you use at least two "rings" for the corosync or just LACP with a single one? Is that one switch or more with MLAG? If the latter, I would wonder about what was in the logs during the outage.

What's the reason you clustered it 3+3? Do you actually need HA across the two sites? Thinking of possible fail scenarios (e.g. switch dying), it's actually not more available, in fact it is less available than with HA off because you only lose half of the resources from any site if e.g. the link goes down, but you lose everything on both sites with HA on.

If you only need to migrate (i.e. your only reason to have it one 6-node cluster) across the two sites, you could make use of qm remote-migrate or PBS or manual ZFS send/receive, for instance. Alternatively you could have at least a Q device to remain quorate at least on one site.

Yes is a 2 fiber path, and MLAG to 2 switches.

Bash:

Jul 31 15:56:32 pve5 corosync[2752]:   [KNET  ] link: host: 6 link: 0 is down
Jul 31 15:56:32 pve5 corosync[2752]:   [KNET  ] host: host: 6 (passive) best link: 1 (pri: 1)
Jul 31 15:56:35 pve5 corosync[2752]:   [KNET  ] rx: host: 6 link: 0 is up
Jul 31 15:56:35 pve5 corosync[2752]:   [KNET  ] link: Resetting MTU for link 0 because host 6 joined
Jul 31 15:56:35 pve5 corosync[2752]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 31 15:56:35 pve5 corosync[2752]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jul 31 15:56:39 pve5 corosync[2752]:   [KNET  ] link: host: 6 link: 1 is down
Jul 31 15:56:39 pve5 corosync[2752]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 31 15:56:40 pve5 corosync[2752]:   [KNET  ] link: host: 6 link: 0 is down
Jul 31 15:56:40 pve5 corosync[2752]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 31 15:56:40 pve5 corosync[2752]:   [KNET  ] host: host: 6 has no active links
Jul 31 15:56:41 pve5 corosync[2752]:   [TOTEM ] Token has not been received in 4200 ms
Jul 31 15:56:48 pve5 watchdog-mux[2218]: client watchdog expired - disable watchdog updates
Jul 31 15:56:49 pve5 corosync[2752]:   [QUORUM] Sync members[3]: 1 4 5
Jul 31 15:56:49 pve5 corosync[2752]:   [QUORUM] Sync left[2]: 3 6
Jul 31 15:56:49 pve5 corosync[2752]:   [TOTEM ] A new membership (1.105) was formed. Members left: 3 6
Jul 31 15:56:49 pve5 corosync[2752]:   [TOTEM ] Failed to receive the leave message. failed: 3 6
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] notice: members: 1/3756085, 4/4010098, 5/1261462
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] notice: starting data syncronisation
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [status] notice: members: 1/3756085, 4/4010098, 5/1261462
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [status] notice: starting data syncronisation
Jul 31 15:56:49 pve5 corosync[2752]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 31 15:56:49 pve5 corosync[2752]:   [QUORUM] Members[3]: 1 4 5
Jul 31 15:56:49 pve5 corosync[2752]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [status] notice: node lost quorum
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] notice: received sync request (epoch 1/3756085/0000000C)
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [status] notice: received sync request (epoch 1/3756085/0000000C)
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] notice: received all states
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] notice: leader is 1/3756085
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] notice: synced members: 1/3756085, 4/4010098, 5/1261462
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] notice: all data is up to date
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] notice: dfsm_deliver_queue: queue length 18
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] crit: received write while not quorate - trigger resync
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] crit: leaving CPG group
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [status] notice: received all states
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [status] notice: all data is up to date
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [status] notice: dfsm_deliver_queue: queue length 219
Jul 31 15:56:50 pve5 pmxcfs[1261462]: [dcdb] notice: start cluster connection
Jul 31 15:56:50 pve5 pmxcfs[1261462]: [dcdb] crit: cpg_join failed: 14
Jul 31 15:56:50 pve5 pmxcfs[1261462]: [dcdb] crit: can't initialize service
Jul 31 15:56:50 pve5 pve-ha-crm[1262518]: loop take too long (64 seconds)
Jul 31 15:56:50 pve5 pvescheduler[3484246]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 31 15:56:50 pve5 pve-ha-crm[1262518]: status change slave => wait_for_quorum
Jul 31 15:56:50 pve5 pve-ha-lrm[1262029]: lost lock 'ha_agent_pve5_lock - cfs lock update failed - Device or resource busy
Jul 31 15:56:50 pve5 pve-ha-lrm[1262029]: status change active => lost_agent_lock
Jul 31 15:56:54 pve5 corosync[2752]:   [KNET  ] link: host: 1 link: 0 is down
Jul 31 15:56:54 pve5 corosync[2752]:   [KNET  ] link: host: 1 link: 1 is down
Jul 31 15:56:54 pve5 corosync[2752]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 31 15:56:54 pve5 corosync[2752]:   [KNET  ] host: host: 1 has no active links
Jul 31 15:56:54 pve5 corosync[2752]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 31 15:56:54 pve5 corosync[2752]:   [KNET  ] host: host: 1 has no active links
Jul 31 15:56:55 pve5 pve-ha-lrm[1262029]: loop take too long (58 seconds)
Jul 31 15:56:56 pve5 corosync[2752]:   [TOTEM ] Token has not been received in 4200 ms
Jul 31 15:56:57 pve5 corosync[2752]:   [TOTEM ] A processor failed, forming new configuration: token timed out (5600ms), waiting 6720ms for consensus.
--- reboot ---

Mainly for the migration of the VM between the 2 sites (yes one 6-node cluster in 2 location), after the careful thoughts from this, there is no need to HA. Had the storage on CEPH with Stretched mode which seems to do the trick to keep both sites storage available.

esi_y · Aug 1, 2024

ernestneo said:
Yes is a 2 fiber path, and MLAG to 2 switches.

Ok looking below now, let me guess, both of the corosync link are actually the same network path, physically?

ernestneo said:

Bash:

Jul 31 15:56:32 pve5 corosync[2752]:   [KNET  ] link: host: 6 link: 0 is down
Jul 31 15:56:32 pve5 corosync[2752]:   [KNET  ] host: host: 6 (passive) best link: 1 (pri: 1)
Jul 31 15:56:35 pve5 corosync[2752]:   [KNET  ] rx: host: 6 link: 0 is up
Jul 31 15:56:35 pve5 corosync[2752]:   [KNET  ] link: Resetting MTU for link 0 because host 6 joined
Jul 31 15:56:35 pve5 corosync[2752]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 31 15:56:35 pve5 corosync[2752]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jul 31 15:56:39 pve5 corosync[2752]:   [KNET  ] link: host: 6 link: 1 is down
Jul 31 15:56:39 pve5 corosync[2752]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 31 15:56:40 pve5 corosync[2752]:   [KNET  ] link: host: 6 link: 0 is down
Jul 31 15:56:40 pve5 corosync[2752]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 31 15:56:40 pve5 corosync[2752]:   [KNET  ] host: host: 6 has no active links
Jul 31 15:56:41 pve5 corosync[2752]:   [TOTEM ] Token has not been received in 4200 ms
Jul 31 15:56:48 pve5 watchdog-mux[2218]: client watchdog expired - disable watchdog updates

I am afraid the HA setup actually helped you discover that the extra link ("ring") setup like this is not really redundant, ironically due to the MLAG.

4s+ is a really long time. This could be brought down with BFD to make that MLAG actually useful in terms of corosync traffic.

ernestneo said:
Mainly for the migration of the VM between the 2 sites (yes one 6-node cluster in 2 location), after the careful thoughts from this, there is no need to HA.

Also a solution.

ernestneo · Aug 2, 2024

esi_y said:
Ok looking below now, let me guess, both of the corosync link are actually the same network path, physically?

They are separated physically at the node level but converged into the same ring path (MLAG).

esi_y said:
I am afraid the HA setup actually helped you discover that the extra link ("ring") setup like this is not really redundant, ironically due to the MLAG.

4s+ is a really long time. This could be brought down with BFD to make that MLAG actually useful in terms of corosync traffic.

Thanks again for highlighting the issues. Putting BFD sounds a great approach, shall reproduce this in a lab.

So a fully redundant path will requires 4 fibers, making 2 seperate MLAG, 1 path for the primary, the other for the failover.

esi_y · Aug 2, 2024

ernestneo said:
They are separated physically at the node level but converged into the same ring path (MLAG).

Yes, it basically only protects against individual NIC failure.

ernestneo said:
Thanks again for highlighting the issues. Putting BFD sounds a great approach, shall reproduce this in a lab.

Definitely would need to test it out. Don't take my word for it.

ernestneo said:
So a fully redundant path will requires 4 fibers, making 2 seperate MLAG, 1 path for the primary, the other for the failover.

Or counterintuitively (and hypothetically) - not to use MLAG. If you can simulate the same type of failure and actually check the failover time, it would help confirming it. Of course you may not want to have such connection without MLAG, so yes in that case 2 separate ones.

Search

Search

6 node HA cluster split brain

ernestneo

Member

spirit

Distinguished Member

UdoB

Distinguished Member

ernestneo

Member

Azunai333

Active Member

esi_y

Renowned Member

ernestneo

Member

ernestneo

Member

esi_y

Renowned Member

ernestneo

Member

esi_y

Renowned Member

ernestneo

Member

esi_y

Renowned Member

ernestneo

Member

esi_y

Renowned Member

We value your privacy