I've had a fairly stable cluster running for quite a while. I needed to add two additional nodes. After the add of the first additional node, I suspect the cluster entered a "split brain" situation, because it started hiccuping, and at this point the GUI is non-responsive on most nodes. Basically, upon entering the gui, all the machines appear to be offline with a small red X, although they are in fact online and VMs continue to work as expected.
I have verified that /etc/pve/corosync.conf is the same on all nodes. Of course, /etc/pve is read only on all nodes at this point.
pvecm status on all nodes shows the following (with the only difference being the last line, with the machine's IP address).
Two networks are available to corosync, a 40g network and an ethernet 1g network. I have verified all nodes have communication with other nodes on the network, on both subnets. ping times are very consistent, measuring below 0.1ms. I have checked my switch logs to look for any abnormal log entries, nothing is present. I have verified all machines are using the same mtu (40g=9k, 1g=1500). Nothing really special in corosync or anything that looks broken.
All my nodes (from the pvecm status output) are talking currently on 192.168.228.x, which is my 1G network, with dedicated switches for nothing but corosync traffic. The 172.16 is my "public" network.
I enabled syslogging on one of my nodes, and it seems I am getting spammed with messages which appear as follows:
These messages are completely spamming the log and will fill up the disk in no time, so I can't have the log enabled for a long period of time, or I have to continually truncate it. It also makes it extremely difficult to get other meaningful diagnostic data since there are
I also periodically will see messages like:
However, I see no indication on the switch that the links are in fact bouncing up and down, and if I sit on a machine I can continually ping every other host on the network with no packet loss nor significant differences in the ping times. Here is what I will typically see from ping output.
I'm not sure why I get the MTU reset message, I have confirmed the MTU is the same on every host.
rtt min/avg/max/mdev = 0.048/0.058/0.071/0.009 ms
So from what I can tell, it doesn't appear I have a physical network issue with both corosync networks (which I would find odd that both physical network switches would develop issues after adding the 10th node), yet for some reason corosync is barfing.
I've tried physically rebooting all the machines simultaneously. This didn't fix the issue.
I'm sort of out of ideas at this point as to what to try next, except perhaps do a "pvecm expected 1" on the nodes, change the votes for one of the nodes from 1 to 2 so that there would in fact be a quorum.
I have verified that /etc/pve/corosync.conf is the same on all nodes. Of course, /etc/pve is read only on all nodes at this point.
pvecm status on all nodes shows the following (with the only difference being the last line, with the machine's IP address).
Code:
root@ceph-6:~# pvecm status
Cluster information
-------------------
Name: GRAVITY
Config Version: 45
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Sun Apr 13 10:16:08 2025
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 2.190d3
Quorate: No
Votequorum information
----------------------
Expected votes: 10
Highest expected: 10
Total votes: 1
Quorum: 6 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
0x00000002 1 192.168.228.21 (local)
Two networks are available to corosync, a 40g network and an ethernet 1g network. I have verified all nodes have communication with other nodes on the network, on both subnets. ping times are very consistent, measuring below 0.1ms. I have checked my switch logs to look for any abnormal log entries, nothing is present. I have verified all machines are using the same mtu (40g=9k, 1g=1500). Nothing really special in corosync or anything that looks broken.
Code:
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: ceph-1
nodeid: 1
quorum_votes: 1
ring0_addr: 192.168.228.25
ring1_addr: 172.16.228.25
}
node {
name: ceph-2
nodeid: 2
quorum_votes: 1
ring0_addr: 192.168.228.26
ring1_addr: 172.16.228.26
}
node {
name: ceph-3
nodeid: 3
quorum_votes: 1
ring0_addr: 192.168.228.32
ring1_addr: 172.16.228.32
}
node {
name: ceph-4
nodeid: 4
quorum_votes: 1
ring0_addr: 192.168.228.34
ring1_addr: 172.16.228.34
}
node {
name: ceph-5
nodeid: 5
quorum_votes: 1
ring0_addr: 192.168.228.28
ring1_addr: 172.16.228.28
}
node {
name: ceph-6
nodeid: 6
quorum_votes: 1
ring0_addr: 192.168.228.21
ring1_addr: 172.16.228.21
}
node {
name: ceph-7
nodeid: 7
quorum_votes: 1
ring0_addr: 192.168.228.35
ring1_addr: 172.16.228.35
}
node {
name: ceph-8
nodeid: 8
quorum_votes: 1
ring0_addr: 192.168.228.36
ring1_addr: 172.16.228.36
}
node {
name: ceph-9
nodeid: 9
quorum_votes: 1
ring0_addr: 192.168.228.37
ring1_addr: 172.16.228.37
}
node {
name: ceph-10
nodeid: 10
quorum_votes: 1
ring0_addr: 192.168.228.38
ring1_addr: 172.16.228.38
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: GRAVITY
config_version: 45
interface {
linknumber: 0
}
interface {
linknumber: 1
}
ip_version: ipv4-6
link_mode: passive
secauth: on
All my nodes (from the pvecm status output) are talking currently on 192.168.228.x, which is my 1G network, with dedicated switches for nothing but corosync traffic. The 172.16 is my "public" network.
I enabled syslogging on one of my nodes, and it seems I am getting spammed with messages which appear as follows:
Code:
2025-04-13T10:27:52.659814-04:00 ceph-3 corosync[1969]: [KNET ] loopback: send local failed. error=Resource temporarily unavailable
These messages are completely spamming the log and will fill up the disk in no time, so I can't have the log enabled for a long period of time, or I have to continually truncate it. It also makes it extremely difficult to get other meaningful diagnostic data since there are
I also periodically will see messages like:
Code:
corosync[1969]: [KNET ] link: host: 8 link: 0 is down
corosync[1969]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1)
corosync[1969]: [KNET ] host: host: 8 has no active links
corosync[1969]: [KNET ] link: host: 2 link: 0 is down
corosync[1969]: [KNET ] host: host: 2 (passive) best link: 1 (pri: 1)
corosync[1969]: [KNET ] link: host: 5 link: 0 is down
corosync[1969]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
corosync[1969]: [KNET ] host: host: 5 has no active links
corosync[1969]: [KNET ] link: host: 7 link: 1 is down
corosync[1969]: [KNET ] link: host: 5 link: 0 is down
corosync[1969]: [KNET ] host: host: 7 (passive) best link: 1 (pri: 1)
corosync[1969]: [KNET ] host: host: 7 has no active links
corosync[1969]: [KNET ] host: host: 5 (passive) best link: 1 (pri: 1)
corosync[1969]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
However, I see no indication on the switch that the links are in fact bouncing up and down, and if I sit on a machine I can continually ping every other host on the network with no packet loss nor significant differences in the ping times. Here is what I will typically see from ping output.
I'm not sure why I get the MTU reset message, I have confirmed the MTU is the same on every host.
rtt min/avg/max/mdev = 0.048/0.058/0.071/0.009 ms
So from what I can tell, it doesn't appear I have a physical network issue with both corosync networks (which I would find odd that both physical network switches would develop issues after adding the 10th node), yet for some reason corosync is barfing.
I've tried physically rebooting all the machines simultaneously. This didn't fix the issue.
I'm sort of out of ideas at this point as to what to try next, except perhaps do a "pvecm expected 1" on the nodes, change the votes for one of the nodes from 1 to 2 so that there would in fact be a quorum.