Assistance recovering from "split brain" / corosync issues.

Stunty

New Member
Apr 13, 2025
18
0
1
I've had a fairly stable cluster running for quite a while. I needed to add two additional nodes. After the add of the first additional node, I suspect the cluster entered a "split brain" situation, because it started hiccuping, and at this point the GUI is non-responsive on most nodes. Basically, upon entering the gui, all the machines appear to be offline with a small red X, although they are in fact online and VMs continue to work as expected.

I have verified that /etc/pve/corosync.conf is the same on all nodes. Of course, /etc/pve is read only on all nodes at this point.

pvecm status on all nodes shows the following (with the only difference being the last line, with the machine's IP address).

Code:
root@ceph-6:~# pvecm status
Cluster information
-------------------
Name:             GRAVITY
Config Version:   45
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sun Apr 13 10:16:08 2025
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000002
Ring ID:          2.190d3
Quorate:          No

Votequorum information
----------------------
Expected votes:   10
Highest expected: 10
Total votes:      1
Quorum:           6 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 192.168.228.21 (local)

Two networks are available to corosync, a 40g network and an ethernet 1g network. I have verified all nodes have communication with other nodes on the network, on both subnets. ping times are very consistent, measuring below 0.1ms. I have checked my switch logs to look for any abnormal log entries, nothing is present. I have verified all machines are using the same mtu (40g=9k, 1g=1500). Nothing really special in corosync or anything that looks broken.

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: ceph-1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.228.25
    ring1_addr: 172.16.228.25
  }
  node {
    name: ceph-2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.228.26
    ring1_addr: 172.16.228.26
  }
  node {
    name: ceph-3
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.228.32
    ring1_addr: 172.16.228.32
  }
  node {
    name: ceph-4
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.228.34
    ring1_addr: 172.16.228.34
  }
  node {
    name: ceph-5
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 192.168.228.28
    ring1_addr: 172.16.228.28
  }
  node {
    name: ceph-6
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 192.168.228.21
    ring1_addr: 172.16.228.21
  }
  node {
    name: ceph-7
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 192.168.228.35
    ring1_addr: 172.16.228.35
  }
  node {
    name: ceph-8
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 192.168.228.36
    ring1_addr: 172.16.228.36
  }
  node {
    name: ceph-9
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 192.168.228.37
    ring1_addr: 172.16.228.37
  }
  node {
    name: ceph-10
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 192.168.228.38
    ring1_addr: 172.16.228.38
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: GRAVITY
  config_version: 45
  interface {
    linknumber: 0
  }
  interface {
    linknumber: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on

All my nodes (from the pvecm status output) are talking currently on 192.168.228.x, which is my 1G network, with dedicated switches for nothing but corosync traffic. The 172.16 is my "public" network.

I enabled syslogging on one of my nodes, and it seems I am getting spammed with messages which appear as follows:

Code:
2025-04-13T10:27:52.659814-04:00 ceph-3 corosync[1969]:   [KNET  ] loopback: send local failed. error=Resource temporarily unavailable

These messages are completely spamming the log and will fill up the disk in no time, so I can't have the log enabled for a long period of time, or I have to continually truncate it. It also makes it extremely difficult to get other meaningful diagnostic data since there are

I also periodically will see messages like:

Code:
corosync[1969]:   [KNET  ] link: host: 8 link: 0 is down
corosync[1969]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
corosync[1969]:   [KNET  ] host: host: 8 has no active links
corosync[1969]:   [KNET  ] link: host: 2 link: 0 is down
corosync[1969]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 1)
corosync[1969]:   [KNET  ] link: host: 5 link: 0 is down
corosync[1969]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
corosync[1969]:   [KNET  ] host: host: 5 has no active links
corosync[1969]:   [KNET  ] link: host: 7 link: 1 is down
corosync[1969]:   [KNET  ] link: host: 5 link: 0 is down
corosync[1969]:   [KNET  ] host: host: 7 (passive) best link: 1 (pri: 1)
corosync[1969]:   [KNET  ] host: host: 7 has no active links
corosync[1969]:   [KNET  ] host: host: 5 (passive) best link: 1 (pri: 1)
corosync[1969]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined

However, I see no indication on the switch that the links are in fact bouncing up and down, and if I sit on a machine I can continually ping every other host on the network with no packet loss nor significant differences in the ping times. Here is what I will typically see from ping output.

I'm not sure why I get the MTU reset message, I have confirmed the MTU is the same on every host.

rtt min/avg/max/mdev = 0.048/0.058/0.071/0.009 ms

So from what I can tell, it doesn't appear I have a physical network issue with both corosync networks (which I would find odd that both physical network switches would develop issues after adding the 10th node), yet for some reason corosync is barfing.

I've tried physically rebooting all the machines simultaneously. This didn't fix the issue.

I'm sort of out of ideas at this point as to what to try next, except perhaps do a "pvecm expected 1" on the nodes, change the votes for one of the nodes from 1 to 2 so that there would in fact be a quorum.
 
If they all show one vote (?) then it seems they’re not talking at all…

Have you tried powering off/disconnecting the new node? And then maybe the switch?
 
If they all show one vote (?) then it seems they’re not talking at all…

Have you tried powering off/disconnecting the new node? And then maybe the switch?
Node, yes, switch no, since there's more attached to it than just the cluster equipment.

I do not know why corosync can't talk to other nodes when I can just fine from a linux shell.
 
PVE Firewall? I don’t recall offhand about corosync but it blocks Ceph.
Firewall is disabled.

Keep in mind this was a functional cluster until the addition of the 10th node, and hiccupped before I could add the 11th.
 
I ran into this issue a lot whenever we used clustering. A reliable way I found to fix it:

- Run this command on all nodes in the cluster at the **same exact time**: killall -9 corosync; service corosync restart
- I'd recommend using an SSH client like iTerm that lets you broadcast input to all tabs at once. Or, schedule it to run at a certain time.

-After that, all nodes should be back in one corosync cluster. You can verify by running 'pvecm status' and should see all nodes.

- Then, run service pve-cluster restart on each node, 1 at a time. You may need to restart other PVE services that could be hung as well.
 
  • Like
Reactions: KatrinaD and Stunty
Split-brain is resolved.

Had to shut down cluster services on all nodes, create a new corosync.conf with an odd vote count, copy to all nodes (scp -p to preserve creation and last modified times), and then restarted all nodes simultaneously. Thanks goes to _--James--_ on Reddit for the assist.

dw-cj: your suggestion, I think, is more or less pretty much the same thing... assuming the corosync.conf file on each node is identical.
 
Split-brain is resolved.

Had to shut down cluster services on all nodes, create a new corosync.conf with an odd vote count, copy to all nodes (scp -p to preserve creation and last modified times), and then restarted all nodes simultaneously. Thanks goes to _--James--_ on Reddit for the assist.

dw-cj: your suggestion, I think, is more or less pretty much the same thing... assuming the corosync.conf file on each node is identical.
Yep, the only way I was able to fix it was starting the service at the same time on all of them. I ended up just moving away from Proxmox clustering. It seems to break down pretty bad and cause constant annoyance like this once you get 15+ nodes in a cluster, at least for us.
 
Yep, the only way I was able to fix it was starting the service at the same time on all of them. I ended up just moving away from Proxmox clustering. It seems to break down pretty bad and cause constant annoyance like this once you get 15+ nodes in a cluster, at least for us.
Overall it has worked well for me, but the whole corosync thing seems sort of kludgey. I'm still not sure what exactly happened to cause it to vomit on my cluster, and the fact it spams the shit out of /var/log/syslog makes it incredibly difficult to debug -- not that the messages it logs are particularly informative. The messages I was receiving (above) with links coming up and down seemed to imply a network issue, but I had no network issues (and enabled debugging on my switch to get verbose messages, but saw nothing abnormal.)
 
Overall it has worked well for me, but the whole corosync thing seems sort of kludgey. I'm still not sure what exactly happened to cause it to vomit on my cluster, and the fact it spams the shit out of /var/log/syslog makes it incredibly difficult to debug -- not that the messages it logs are particularly informative. The messages I was receiving (above) with links coming up and down seemed to imply a network issue, but I had no network issues (and enabled debugging on my switch to get verbose messages, but saw nothing abnormal.)
One small network issue can cause the whole thing to go haywire. It will get worse with more nodes too.