6 node HA cluster split brain

ernestneo

Member
Sep 5, 2021
14
0
6
40
I have a spontaneous reboot of the entire cluster when the link between 2 sites is disrupted.

Bash:
Cluster information
-------------------
Name:             HA1
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Aug  1 10:47:46 2024
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000001
Ring ID:          1.11f
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   6
Highest expected: 6
Total votes:      5
Quorum:           4 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 172.16.1.1 (local)
0x00000003          1 172.16.1.3
0x00000004          1 172.16.1.4
0x00000005          1 172.16.1.5
0x00000006          1 172.16.1.6

[3 nodes] < --x-- > [ 2 nodes , 1 node down]


will pvecm expected 2 helps in this scenario? while maintaining both split nodes available?
 
I have a spontaneous reboot of the entire cluster when the link between 2 sites is disrupted.

That's expected behavior. Each site has 3 votes - which is not "more than 50%".

Note that it triggers a reboot of a node only if High Availability is being used in at that specific node.

will pvecm expected 2 helps in this scenario? while maintaining both split nodes available?

This is a setting that really should only be used for desaster recovery - which may be the mode you have been in this morning.

Lying to the system that "expected 2" is fine lets both halfs of the cluster do what they want. With a good chance to create bad situations like starting the same VM on both sides.

So, yes: "expected 2" (or 3) will help get going. But be really careful...!

As @spirit already said: only a third site as a neutral witness with a connection path independent of the direct link between the two main sites will help to avoid this drama.
 
Last edited:
  • Like
Reactions: ernestneo
Note that it triggers a reboot of a node only if High Availability is being used in at that specific node.

Thank you, clearer on the behaviour.
So with clustered node and no VM put on HA will prevent the cluster from killing itself should the link breaks happens?
 
So with clustered node and no VM put on HA will prevent the cluster from killing itself should the link breaks happens?
Yes. Look at "Datacenter > HA > Status > Column Status". If the nodes are not "active" anymore, they shouldn't reboot anymore automatically.
This happens 10 minutes after the last HA of a VM is deleted or set on "ignored".

As the others said: If you want to use HA, be sure there is one side with >=51% left.
 
  • Like
Reactions: UdoB
Yes. Look at "Datacenter > HA > Status > Column Status". If the nodes are not "active" anymore, they shouldn't reboot anymore automatically.
This happens 10 minutes after the last HA of a VM is deleted or set on "ignored".

As the others said: If you want to use HA, be sure there is one side with >=51% left.
1722498228856.png

Yup got it now, thanks again.
 
[3 nodes @ Site A ] [switch] ---- 100G ---- [switch] [ 3 nodes @ Site B]

So I suppose this is OM3/OM4 link.

View attachment 72168

Yup got it now, thanks again.

Just be aware that due to a long outstanding bug [1], the "master" will still reboot one last time upon losing quorum after you had disabled all HA. You can reboot it manually to avoid that if that makes more sense to you.

[1] https://bugzilla.proxmox.com/show_bug.cgi?id=5243
 
So I suppose this is OM3/OM4 link.
Link is OS2, Campus cross connect, 2 redundant path, LACP.

However one of the fiber had issue (not a hard down) causing huge packet losses and the drama happened.
Just be aware that due to a long outstanding bug [1], the "master" will still reboot one last time upon losing quorum after you had disabled all HA. You can reboot it manually to avoid that if that makes more sense to you.
Ah yes i read about that bug, thanks for filing it and the heads up.
 
Link is OS2, Campus cross connect, 2 redundant path, LACP.

However one of the fiber had issue (not a hard down) causing huge packet losses and the drama happened.

Do you use at least two "rings" for the corosync or just LACP with a single one? Is that one switch or more with MLAG? If the latter, I would wonder about what was in the logs during the outage.

What's the reason you clustered it 3+3? Do you actually need HA across the two sites? Thinking of possible fail scenarios (e.g. switch dying), it's actually not more available, in fact it is less available than with HA off because you only lose half of the resources from any site if e.g. the link goes down, but you lose everything on both sites with HA on.

If you only need to migrate (i.e. your only reason to have it one 6-node cluster) across the two sites, you could make use of qm remote-migrate or PBS or manual ZFS send/receive, for instance. Alternatively you could have at least a Q device to remain quorate at least on one site.
 
Last edited:
Do you use at least two "rings" for the corosync or just LACP with a single one? Is that one switch or more with MLAG? If the latter, I would wonder about what was in the logs during the outage.

What's the reason you clustered it 3+3? Do you actually need HA across the two sites? Thinking of possible fail scenarios (e.g. switch dying), it's actually not more available, in fact it is less available than with HA off because you only lose half of the resources from any site if e.g. the link goes down, but you lose everything on both sites with HA on.

If you only need to migrate (i.e. your only reason to have it one 6-node cluster) across the two sites, you could make use of qm remote-migrate or PBS or manual ZFS send/receive, for instance. Alternatively you could have at least a Q device to remain quorate at least on one site.


Yes is a 2 fiber path, and MLAG to 2 switches.


Bash:
Jul 31 15:56:32 pve5 corosync[2752]:   [KNET  ] link: host: 6 link: 0 is down
Jul 31 15:56:32 pve5 corosync[2752]:   [KNET  ] host: host: 6 (passive) best link: 1 (pri: 1)
Jul 31 15:56:35 pve5 corosync[2752]:   [KNET  ] rx: host: 6 link: 0 is up
Jul 31 15:56:35 pve5 corosync[2752]:   [KNET  ] link: Resetting MTU for link 0 because host 6 joined
Jul 31 15:56:35 pve5 corosync[2752]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 31 15:56:35 pve5 corosync[2752]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jul 31 15:56:39 pve5 corosync[2752]:   [KNET  ] link: host: 6 link: 1 is down
Jul 31 15:56:39 pve5 corosync[2752]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 31 15:56:40 pve5 corosync[2752]:   [KNET  ] link: host: 6 link: 0 is down
Jul 31 15:56:40 pve5 corosync[2752]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 31 15:56:40 pve5 corosync[2752]:   [KNET  ] host: host: 6 has no active links
Jul 31 15:56:41 pve5 corosync[2752]:   [TOTEM ] Token has not been received in 4200 ms
Jul 31 15:56:48 pve5 watchdog-mux[2218]: client watchdog expired - disable watchdog updates
Jul 31 15:56:49 pve5 corosync[2752]:   [QUORUM] Sync members[3]: 1 4 5
Jul 31 15:56:49 pve5 corosync[2752]:   [QUORUM] Sync left[2]: 3 6
Jul 31 15:56:49 pve5 corosync[2752]:   [TOTEM ] A new membership (1.105) was formed. Members left: 3 6
Jul 31 15:56:49 pve5 corosync[2752]:   [TOTEM ] Failed to receive the leave message. failed: 3 6
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] notice: members: 1/3756085, 4/4010098, 5/1261462
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] notice: starting data syncronisation
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [status] notice: members: 1/3756085, 4/4010098, 5/1261462
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [status] notice: starting data syncronisation
Jul 31 15:56:49 pve5 corosync[2752]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 31 15:56:49 pve5 corosync[2752]:   [QUORUM] Members[3]: 1 4 5
Jul 31 15:56:49 pve5 corosync[2752]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [status] notice: node lost quorum
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] notice: received sync request (epoch 1/3756085/0000000C)
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [status] notice: received sync request (epoch 1/3756085/0000000C)
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] notice: received all states
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] notice: leader is 1/3756085
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] notice: synced members: 1/3756085, 4/4010098, 5/1261462
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] notice: all data is up to date
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] notice: dfsm_deliver_queue: queue length 18
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] crit: received write while not quorate - trigger resync
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [dcdb] crit: leaving CPG group
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [status] notice: received all states
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [status] notice: all data is up to date
Jul 31 15:56:49 pve5 pmxcfs[1261462]: [status] notice: dfsm_deliver_queue: queue length 219
Jul 31 15:56:50 pve5 pmxcfs[1261462]: [dcdb] notice: start cluster connection
Jul 31 15:56:50 pve5 pmxcfs[1261462]: [dcdb] crit: cpg_join failed: 14
Jul 31 15:56:50 pve5 pmxcfs[1261462]: [dcdb] crit: can't initialize service
Jul 31 15:56:50 pve5 pve-ha-crm[1262518]: loop take too long (64 seconds)
Jul 31 15:56:50 pve5 pvescheduler[3484246]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 31 15:56:50 pve5 pve-ha-crm[1262518]: status change slave => wait_for_quorum
Jul 31 15:56:50 pve5 pve-ha-lrm[1262029]: lost lock 'ha_agent_pve5_lock - cfs lock update failed - Device or resource busy
Jul 31 15:56:50 pve5 pve-ha-lrm[1262029]: status change active => lost_agent_lock
Jul 31 15:56:54 pve5 corosync[2752]:   [KNET  ] link: host: 1 link: 0 is down
Jul 31 15:56:54 pve5 corosync[2752]:   [KNET  ] link: host: 1 link: 1 is down
Jul 31 15:56:54 pve5 corosync[2752]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 31 15:56:54 pve5 corosync[2752]:   [KNET  ] host: host: 1 has no active links
Jul 31 15:56:54 pve5 corosync[2752]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jul 31 15:56:54 pve5 corosync[2752]:   [KNET  ] host: host: 1 has no active links
Jul 31 15:56:55 pve5 pve-ha-lrm[1262029]: loop take too long (58 seconds)
Jul 31 15:56:56 pve5 corosync[2752]:   [TOTEM ] Token has not been received in 4200 ms
Jul 31 15:56:57 pve5 corosync[2752]:   [TOTEM ] A processor failed, forming new configuration: token timed out (5600ms), waiting 6720ms for consensus.
--- reboot ---

Mainly for the migration of the VM between the 2 sites (yes one 6-node cluster in 2 location), after the careful thoughts from this, there is no need to HA. Had the storage on CEPH with Stretched mode which seems to do the trick to keep both sites storage available.
 
Yes is a 2 fiber path, and MLAG to 2 switches.

Ok looking below now, let me guess, both of the corosync link are actually the same network path, physically?

Bash:
Jul 31 15:56:32 pve5 corosync[2752]:   [KNET  ] link: host: 6 link: 0 is down
Jul 31 15:56:32 pve5 corosync[2752]:   [KNET  ] host: host: 6 (passive) best link: 1 (pri: 1)
Jul 31 15:56:35 pve5 corosync[2752]:   [KNET  ] rx: host: 6 link: 0 is up
Jul 31 15:56:35 pve5 corosync[2752]:   [KNET  ] link: Resetting MTU for link 0 because host 6 joined
Jul 31 15:56:35 pve5 corosync[2752]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 31 15:56:35 pve5 corosync[2752]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jul 31 15:56:39 pve5 corosync[2752]:   [KNET  ] link: host: 6 link: 1 is down
Jul 31 15:56:39 pve5 corosync[2752]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 31 15:56:40 pve5 corosync[2752]:   [KNET  ] link: host: 6 link: 0 is down
Jul 31 15:56:40 pve5 corosync[2752]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Jul 31 15:56:40 pve5 corosync[2752]:   [KNET  ] host: host: 6 has no active links
Jul 31 15:56:41 pve5 corosync[2752]:   [TOTEM ] Token has not been received in 4200 ms
Jul 31 15:56:48 pve5 watchdog-mux[2218]: client watchdog expired - disable watchdog updates

I am afraid the HA setup actually helped you discover that the extra link ("ring") setup like this is not really redundant, ironically due to the MLAG.

4s+ is a really long time. This could be brought down with BFD to make that MLAG actually useful in terms of corosync traffic.

Mainly for the migration of the VM between the 2 sites (yes one 6-node cluster in 2 location), after the careful thoughts from this, there is no need to HA.

Also a solution. ;)
 
  • Like
Reactions: ernestneo
Ok looking below now, let me guess, both of the corosync link are actually the same network path, physically?
They are separated physically at the node level but converged into the same ring path (MLAG).

I am afraid the HA setup actually helped you discover that the extra link ("ring") setup like this is not really redundant, ironically due to the MLAG.

4s+ is a really long time. This could be brought down with BFD to make that MLAG actually useful in terms of corosync traffic.
Thanks again for highlighting the issues. Putting BFD sounds a great approach, shall reproduce this in a lab.

So a fully redundant path will requires 4 fibers, making 2 seperate MLAG, 1 path for the primary, the other for the failover.
 
They are separated physically at the node level but converged into the same ring path (MLAG).

Yes, it basically only protects against individual NIC failure.

Thanks again for highlighting the issues. Putting BFD sounds a great approach, shall reproduce this in a lab.

Definitely would need to test it out. Don't take my word for it. :)

So a fully redundant path will requires 4 fibers, making 2 seperate MLAG, 1 path for the primary, the other for the failover.

Or counterintuitively (and hypothetically) - not to use MLAG. If you can simulate the same type of failure and actually check the failover time, it would help confirming it. Of course you may not want to have such connection without MLAG, so yes in that case 2 separate ones.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!