Cluster trouble for weeks now

abarakat

Member
Oct 19, 2022
27
2
8
Seems that cluster issues have been around for years. I am not sure what to do anymore. tried all kinds of things. Dedicated corosync network very low latency. 14 nodes cluster with over 700 vm/lxc problems started after having issues after growing the cluster with 600+ vm/lxc's. I have one node I guess not voting that is up and reachable but this is crazy if just one node brings down everything. Also the one node not voting is aying activity blocked. last pvecm status

root@nypve01:/var/log# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.4-3-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-9
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5: 6.5.13-5
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
pve-kernel-5.15.131-2-pve: 5.15.131-3
pve-kernel-5.15.116-1-pve: 5.15.116-1
pve-kernel-5.15.111-1-pve: 5.15.111-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.60-2-pve: 5.15.60-2
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.2-1
proxmox-backup-file-restore: 3.2.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2


root@nypve01:/var/log# more /etc/corosync/corosync.conf
logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: nypve01
nodeid: 1
quorum_votes: 1
ring0_addr: 10.64.0.11
ring1_addr: 10.82.97.65
}
node {
name: nypve02
nodeid: 2
quorum_votes: 1
ring0_addr: 10.64.0.12
ring1_addr: 10.82.97.66
}
node {
name: nypve03
nodeid: 3
quorum_votes: 1
ring0_addr: 10.64.0.13
ring1_addr: 10.82.97.67
}
node {
name: nypve04
nodeid: 4
quorum_votes: 1
ring0_addr: 10.64.0.14
ring1_addr: 10.82.97.68
}
node {
name: nypve05
nodeid: 9
quorum_votes: 1
ring0_addr: 10.64.0.15
ring1_addr: 10.82.97.69
}
node {
name: nypve06
nodeid: 10
quorum_votes: 1
ring0_addr: 10.64.0.16
ring1_addr: 10.82.97.70
}
node {
name: nypve07
nodeid: 11
quorum_votes: 1
ring0_addr: 10.64.0.17
ring1_addr: 10.82.97.71
}
node {
name: sfpve01
nodeid: 5
quorum_votes: 1
ring0_addr: 10.64.0.21
ring1_addr: 10.82.31.9
}
node {
name: sfpve02
nodeid: 6
quorum_votes: 1
ring0_addr: 10.64.0.22
ring1_addr: 10.82.31.10
}
node {
name: sfpve03
nodeid: 7
quorum_votes: 1
ring0_addr: 10.64.0.23
ring1_addr: 10.82.31.11
}
node {
name: sfpve04
nodeid: 8
quorum_votes: 1
ring0_addr: 10.64.0.24
ring1_addr: 10.82.31.12
}
node {
name: sfpve05
nodeid: 12
quorum_votes: 1
ring0_addr: 10.64.0.25
ring1_addr: 10.82.31.13
}
node {
name: sfpve06
nodeid: 13
quorum_votes: 1
ring0_addr: 10.64.0.26
ring1_addr: 10.82.31.14
}
node {
name: sfpve07
nodeid: 14
quorum_votes: 1
ring0_addr: 10.64.0.27
ring1_addr: 10.82.31.15
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: svl-b-labs
config_version: 16
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
version: 2
}


root@nypve01:/var/log# pvecm status
Cluster information
-------------------
Name: svl-b-labs
Config Version: 16
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Wed May 22 11:46:50 2024
Quorum provider: corosync_votequorum
Nodes: 13
Node ID: 0x00000001
Ring ID: 1.beff
Quorate: Yes

Votequorum information
----------------------
Expected votes: 14
Highest expected: 14
Total votes: 13
Quorum: 8
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.64.0.11 (local)
0x00000002 1 10.64.0.12
0x00000003 1 10.64.0.13
0x00000004 1 10.64.0.14
0x00000005 1 10.64.0.21
0x00000006 1 10.64.0.22
0x00000007 1 10.64.0.23
0x00000008 1 10.64.0.24
0x00000009 1 10.64.0.15
0x0000000a 1 10.64.0.16
0x0000000c 1 10.64.0.25
0x0000000d 1 10.64.0.26
0x0000000e 1 10.64.0.27

root@nypve01:/var/log# pvecm nodes

Membership information
----------------------
Nodeid Votes Name
1 1 nypve01 (local)
2 1 nypve02
3 1 nypve03
4 1 nypve04
5 1 sfpve01
6 1 sfpve02
7 1 sfpve03
8 1 sfpve04
9 1 nypve05
10 1 nypve06
12 1 sfpve05
13 1 sfpve06
14 1 sfpve07


root@nypve07:/var/log# pvecm status
Cluster information
-------------------
Name: svl-b-labs
Config Version: 16
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Wed May 22 11:36:03 2024
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x0000000b
Ring ID: 1.c23f
Quorate: No

Votequorum information
----------------------
Expected votes: 14
Highest expected: 14
Total votes: 1
Quorum: 8 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x0000000b 1 10.64.0.17 (local)
 
Last edited:
Hello,

Have you measured the ping between all pair of nodes in the cluster? This should amount to 91 measurements. The worst case should ideally be bellow 1ms.

Have you verified whether all pair of nodes can reach each other via both networks? You can verify this via the

Code:
corosync-cfgtool -n

command. Please ensure that all links report *both* `enabled` and `connected`.
 
Hello,

Have you measured the ping between all pair of nodes in the cluster? This should amount to 91 measurements. The worst case should ideally be bellow 1ms.
Exactly this - I hope you named your nodes just for fun, like they are named. But.... If ny stands for New York and sf for San Francisco AND this is the real location of your nodes, corosync will never be stable, as already mentioned it must be below 1ms.
 
Hello,

Have you measured the ping between all pair of nodes in the cluster? This should amount to 91 measurements. The worst case should ideally be bellow 1ms.

Have you verified whether all pair of nodes can reach each other via both networks? You can verify this via the

Code:
corosync-cfgtool -n

command. Please ensure that all links report *both* `enabled` and `connected`.

Ping response is well below 1ms see below. Not sure what you mean by 91 measurements.


root@nypve01:~# corosync-cfgtool -n
Local node ID 1, transport knet
nodeid: 2 reachable
LINK: 0 udp (10.64.0.11->10.64.0.12) enabled connected mtu: 1397
LINK: 1 udp (10.82.97.65->10.82.97.66) enabled connected mtu: 1397

nodeid: 3 reachable
LINK: 0 udp (10.64.0.11->10.64.0.13) enabled connected mtu: 1397
LINK: 1 udp (10.82.97.65->10.82.97.67) enabled connected mtu: 1397

nodeid: 4 reachable
LINK: 0 udp (10.64.0.11->10.64.0.14) enabled connected mtu: 1397
LINK: 1 udp (10.82.97.65->10.82.97.68) enabled connected mtu: 1397

nodeid: 5 reachable
LINK: 0 udp (10.64.0.11->10.64.0.21) enabled connected mtu: 1397
LINK: 1 udp (10.82.97.65->10.82.31.9) enabled connected mtu: 1397

nodeid: 6 reachable
LINK: 0 udp (10.64.0.11->10.64.0.22) enabled connected mtu: 1397
LINK: 1 udp (10.82.97.65->10.82.31.10) enabled connected mtu: 1397

nodeid: 7 reachable
LINK: 0 udp (10.64.0.11->10.64.0.23) enabled connected mtu: 1397
LINK: 1 udp (10.82.97.65->10.82.31.11) enabled connected mtu: 1397

nodeid: 8 reachable
LINK: 0 udp (10.64.0.11->10.64.0.24) enabled connected mtu: 1397
LINK: 1 udp (10.82.97.65->10.82.31.12) enabled connected mtu: 1397

nodeid: 9 reachable
LINK: 0 udp (10.64.0.11->10.64.0.15) enabled connected mtu: 1397
LINK: 1 udp (10.82.97.65->10.82.97.69) enabled connected mtu: 1397

nodeid: 10 reachable
LINK: 0 udp (10.64.0.11->10.64.0.16) enabled connected mtu: 1397
LINK: 1 udp (10.82.97.65->10.82.97.70) enabled connected mtu: 1397

nodeid: 11 reachable
LINK: 0 udp (10.64.0.11->10.64.0.17) enabled connected mtu: 1397
LINK: 1 udp (10.82.97.65->10.82.97.71) enabled connected mtu: 1397

nodeid: 12 reachable
LINK: 0 udp (10.64.0.11->10.64.0.25) enabled connected mtu: 1397
LINK: 1 udp (10.82.97.65->10.82.31.13) enabled connected mtu: 1397

nodeid: 13 reachable
LINK: 0 udp (10.64.0.11->10.64.0.26) enabled connected mtu: 1397
LINK: 1 udp (10.82.97.65->10.82.31.14) enabled connected mtu: 1397

nodeid: 14 reachable
LINK: 0 udp (10.64.0.11->10.64.0.27) enabled connected mtu: 1397
LINK: 1 udp (10.82.97.65->10.82.31.15) enabled connected mtu: 1397





root@nypve01:~# ping -f 10.64.0.12
PING 10.64.0.12 (10.64.0.12) 56(84) bytes of data.
.^
--- 10.64.0.12 ping statistics ---
76961 packets transmitted, 76961 received, 0% packet loss, time 2900ms
rtt min/avg/max/mdev = 0.026/0.029/0.429/0.004 ms, ipg/ewma 0.037/0.027 ms



root@nypve01:~# ping -f 10.64.0.27
PING 10.64.0.27 (10.64.0.27) 56(84) bytes of data.
^C
--- 10.64.0.27 ping statistics ---
83885 packets transmitted, 83885 received, 0% packet loss, time 3342ms
rtt min/avg/max/mdev = 0.028/0.032/0.251/0.004 ms, ipg/ewma 0.039/0.030 ms
root@nypve01:~#
 
Exactly this - I hope you named your nodes just for fun, like they are named. But.... If ny stands for New York and sf for San Francisco AND this is the real location of your nodes, corosync will never be stable, as already mentioned it must be below 1ms.
Yes for fun bad choice I know. They are all in the same building and corosync on the same network fabric
 
Ping response is well below 1ms see below. Not sure what you mean by 91 measurements.

root@nypve01:~# ping -f 10.64.0.12
PING 10.64.0.12 (10.64.0.12) 56(84) bytes of data.
.^
--- 10.64.0.12 ping statistics ---
76961 packets transmitted, 76961 received, 0% packet loss, time 2900ms
rtt min/avg/max/mdev = 0.026/0.029/0.429/0.004 ms, ipg/ewma 0.037/0.027 ms



root@nypve01:~# ping -f 10.64.0.27
PING 10.64.0.27 (10.64.0.27) 56(84) bytes of data.
^C
--- 10.64.0.27 ping statistics ---
83885 packets transmitted, 83885 received, 0% packet loss, time 3342ms
rtt min/avg/max/mdev = 0.028/0.032/0.251/0.004 ms, ipg/ewma 0.039/0.030 ms
root@nypve01:~#

This only shows that the latency over link 0 between node 1 to node 2 (and 3) is low enough during that very specific moment. As I said, you have to check all possible node combinations, and both links should be tested.
 
This only shows that the latency over link 0 between node 1 to node 2 (and 3) is low enough during that very specific moment. As I said, you have to check all possible node combinations, and both links should be tested.
Sure I will do more testing. Is link 1 a backup link only? or does corosync use both links simultaniously?

Link 1 traverses other network fabrics and is there only as a temp backup for possiblity of maintenance work on link 0.
 
Last edited:
This only shows that the latency over link 0 between node 1 to node 2 (and 3) is low enough during that very specific moment. As I said, you have to check all possible node combinations, and both links should be tested.
I think I found the issue. Upon further ping flood testing on the corosync network, I noticed one host would have intermittent behavior and occasionally shoot up to 5ms response driving the average just over 1ms. I moved the cable to another switch port and so far I am not able to replicate the ping flood issue. Now cluster seems stable time will tell.

If this is really what was the issue, I am concerned that one host is causing problems with the whole cluster. This seems to defeat the purpose of a distributed cluster system.

Thanks all for your insights.
 
  • Like
Reactions: leesteken
If this is really what was the issue, I am concerned that one host is causing problems with the whole cluster. This seems to defeat the purpose of a distributed cluster system.
It's not clear to me from the information I see in this post: which are the problems of the whole cluster? The only problem I see is that one single node was out of quorum, but the rest of the cluster was still up and running, quorate and providing services correctly. Just trying to understand the symptoms that you are seeing.
 
Not sure what you mean by 91 measurements.
Thats the maximum (all) connections between the nodes.

The formula is (n*(n-1))/2 where n is the number of nodes, so in your case that would be (14*(14-1))/2=91


If algebra is not your thing, look at it this way: All of the 14 nodes has to connect to all of the 14 nodes, excepting for itself (maybe a mad node has to be tested for talking to itself!). So each node has 13 connections to make, but connection of node1 to node2 is the same connection (we're testing) as node2 to node1, so we won't count that again.

Therefore node1 has 13 connections, node2 12 connections (as one of them is already covered by node1's 13 connections), node3 has only 11 connections (as one of them is already covered by node1's 13 connections & another is covered by node2's 12) etc. etc. When you reach the last node14 he has already been completely covered.

So in fact the number is going to be 13+12+11+10+9+8+7+6+5+4+3+2+1=91
 
Last edited:
  • Like
Reactions: abarakat
It's not clear to me from the information I see in this post: which are the problems of the whole cluster? The only problem I see is that one single node was out of quorum, but the rest of the cluster was still up and running, quorate and providing services correctly. Just trying to understand the symptoms that you are seeing.
Symptoms where all over the place. Most of the time I can not login via web to most hosts. In some cases I can log in to one host and see that it's still in sync with some of the host but rest are all red and inaccessable. However I can ssh to all and restart services to recover.
 
This only shows that the latency over link 0 between node 1 to node 2 (and 3) is low enough during that very specific moment. As I said, you have to check all possible node combinations, and both links should be tested.
Looks like I lost cluster again.

I think I would like to disable the secondary link what is the best way to accomplish this?
 
In my experience that won't help at all. In a properly working cluster, corosync uses primary link to send most of it's communication, while the secondary link is only monitored for reachability, packet loss and jitter. If some machine in the cluster is unreachable by link0, the cluster communication with that single server is sent to/from the link1. Issues with link1 will not affect quorum or link0 cummnications in any way (as long as link0 is working, of course).

Whats the output of corosync-cfgtool -n in every node of the cluster? What's the output of pvecm status in every node? Compare both outputs to see if they match for every node. Check with tcpdump if you are really getting corosync traffic for every node in every node (pay attention to MTU sizes of the packets).

This is a hard issue to solve in the forum, too many variables that may produce it.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!