PVE6.2 - Can't add 2 new nodes to cluster

Aug 20, 2018
20
0
6
Hello,
we are using PVE since Version 2 und after years of smooth operation I am now encountering some bad headache problem .
We run a 6.2 cluster with 13 nodes and want to add another 2 nodes.
After issuing pvecm add <existingnode> for the first new node the whole quorum got lost and corosync did not form new memberships anymore and looses every former member.
I had to interrupt the join operation and had to stop and restart the corosync service on all nodes. So i did pvecm del <firstnewnode> and tried again with the second new node (had to reinstall the first one, because some problems with the pmxfs).

No luck so far.
I dont know whats happening here. Maybe a problem with the versions ? The new nodes have newer kernels and pve-manager than the clusternodes (no updates available).

I never had such strange behaviour when adding new nodes. The configurations for network and the hosts files follows the same schema on every host. Nothing special.
This gives me some headache the last hours. Any suggestions on this ? Which info may I provide ?

Here are some infos:

<newnode>

Code:
Aug 28 16:57:58 srvhost12 pmxcfs[26519]: [status] notice: cpg_send_message retry 70
Aug 28 16:57:58 srvhost12 systemd[1]: Started Session 32 of user root.
Aug 28 16:57:59 srvhost12 corosync[26513]:   [TOTEM ] A new membership (e.122d) was formed. Members left: 1 2 3 4 5 6 7 8 9 10 11 12 13
Aug 28 16:57:59 srvhost12 corosync[26513]:   [TOTEM ] Failed to receive the leave message. failed: 1 2 3 4 5 6 7 8 9 10 11 12 13
Aug 28 16:57:59 srvhost12 corosync[26513]:   [QUORUM] Members[1]: 14
Aug 28 16:57:59 srvhost12 corosync[26513]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 28 16:57:59 srvhost12 corosync[26513]:   [TOTEM ] A new membership (1.1231) was formed. Members joined: 1 2 3 4 5 6 7 8 9 10 11 12 13
Aug 28 16:57:59 srvhost12 pmxcfs[26519]: [status] notice: cpg_send_message retry 80
Aug 28 16:58:00 srvhost12 pmxcfs[26519]: [status] notice: cpg_send_message retry 90
Aug 28 16:58:00 srvhost12 systemd[1]: Starting Proxmox VE replication runner...
Aug 28 16:58:01 srvhost12 pmxcfs[26519]: [status] notice: cpg_send_message retry 100
Aug 28 16:58:01 srvhost12 pmxcfs[26519]: [status] notice: cpg_send_message retried 100 times
Aug 28 16:58:01 srvhost12 pmxcfs[26519]: [status] crit: cpg_send_message failed: 6
Aug 28 16:58:01 srvhost12 pvesr[27714]: error during cfs-locked 'file-replication_cfg' operation: no quorum!
Aug 28 16:58:01 srvhost12 pvestatd[1371]: status update time (30.059 seconds)
Aug 28 16:58:01 srvhost12 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 28 16:58:01 srvhost12 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Aug 28 16:58:01 srvhost12 systemd[1]: Failed to start Proxmox VE replication runner.

proxmox-ve: 6.2-1 (running kernel: 5.4.55-1-pve)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-5
pve-kernel-helper: 6.2-5
pve-kernel-5.4.55-1-pve: 5.4.55-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.1-2
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-10
pve-cluster: 6.1-8
pve-container: 3.1-12
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-2
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-12
pve-xtermjs: 4.7.0-1
qemu-server: 6.2-11
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1



<existingnode>
Code:
Aug 28 16:59:30 srvdata1 corosync[2515293]:   [CPG   ] *** 0x55c757f710b0 can't mcast to group pve_kvstore_v1 state:1, error:12
Aug 28 16:59:30 srvdata1 corosync[2515293]:   [MAIN  ] qb_ipcs_event_send: Transport endpoint is not connected (107)
Aug 28 16:59:36 srvdata1 corosync[2515293]:   [TOTEM ] Token has not been received in 6611 ms
Aug 28 16:59:41 srvdata1 corosync[2515293]:   [TOTEM ] Retransmit List: 56
Aug 28 16:59:41 srvdata1 corosync[2515293]:   [TOTEM ] A new membership (1.1a05) was formed. Members
Aug 28 16:59:41 srvdata1 corosync[2515293]:   [QUORUM] Members[10]: 1 2 4 5 6 7 9 11 12 13
Aug 28 16:59:41 srvdata1 corosync[2515293]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 28 16:59:41 srvdata1 corosync[2515293]:   [MAIN  ] Q empty, queued:0 sent:479.
Aug 28 17:00:00 srvdata1 corosync[2515293]:   [TOTEM ] A new membership (1.1a11) was formed. Members joined: 2 4 left: 2 4
Aug 28 17:00:00 srvdata1 corosync[2515293]:   [TOTEM ] Failed to receive the leave message. failed: 2 4
Aug 28 17:00:00 srvdata1 corosync[2515293]:   [TOTEM ] Retransmit List: 1
Aug 28 17:00:00 srvdata1 corosync[2515293]:   [QUORUM] Members[10]: 1 2 4 5 6 7 9 11 12 13
Aug 28 17:00:00 srvdata1 corosync[2515293]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 28 17:00:04 srvdata1 corosync[2515293]:   [KNET  ] rx: host: 10 link: 0 is up
Aug 28 17:00:04 srvdata1 corosync[2515293]:   [KNET  ] host: host: 10 (passive) best link: 0 (pri: 1)
Aug 28 17:00:06 srvdata1 corosync[2515293]:   [TOTEM ] A new membership (1.1a15) was formed. Members joined: 3
Aug 28 17:00:06 srvdata1 corosync[2515293]:   [QUORUM] Members[11]: 1 2 3 4 5 6 7 9 11 12 13
Aug 28 17:00:06 srvdata1 corosync[2515293]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 28 17:00:10 srvdata1 corosync[2515293]:   [TOTEM ] A new membership (1.1a19) was formed. Members joined: 8
Aug 28 17:00:11 srvdata1 corosync[2515293]:   [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 11 12 13
Aug 28 17:00:11 srvdata1 corosync[2515293]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 28 17:00:12 srvdata1 corosync[2515293]:   [TOTEM ] A new membership (1.1a1d) was formed. Members
Aug 28 17:00:12 srvdata1 corosync[2515293]:   [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 11 12 13
Aug 28 17:00:12 srvdata1 corosync[2515293]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 28 17:00:22 srvdata1 corosync[2515293]:   [TOTEM ] A new membership (1.1a21) was formed. Members
Aug 28 17:00:22 srvdata1 corosync[2515293]:   [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 11 12 13
Aug 28 17:00:22 srvdata1 corosync[2515293]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 28 17:00:22 srvdata1 corosync[2515293]:   [TOTEM ] A new membership (1.1a25) was formed. Members joined: 10
Aug 28 17:00:22 srvdata1 corosync[2515293]:   [QUORUM] Members[13]: 1 2 3 4 5 6 7 8 9 10 11 12 13
Aug 28 17:00:22 srvdata1 corosync[2515293]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 28 17:00:33 srvdata1 corosync[2515293]:   [TOTEM ] A new membership (1.1a29) was formed. Members joined: 14
Aug 28 17:00:40 srvdata1 corosync[2515293]:   [TOTEM ] A new membership (1.1a2d) was formed. Members left: 14
Aug 28 17:00:40 srvdata1 corosync[2515293]:   [TOTEM ] Failed to receive the leave message. failed: 14
Aug 28 17:00:40 srvdata1 corosync[2515293]:   [QUORUM] Members[13]: 1 2 3 4 5 6 7 8 9 10 11 12 13
Aug 28 17:00:40 srvdata1 corosync[2515293]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 28 17:00:50 srvdata1 corosync[2515293]:   [TOTEM ] A new membership (1.1a31) was formed. Members
Aug 28 17:01:00 srvdata1 corosync[2515293]:   [TOTEM ] A new membership (1.1a35) was formed. Members
Aug 28 17:01:01 srvdata1 corosync[2515293]:   [TOTEM ] A new membership (1.1a39) was formed. Members
Aug 28 17:01:01 srvdata1 corosync[2515293]:   [TOTEM ] A new membership (1.1a3d) was formed. Members joined: 14
Aug 28 17:01:08 srvdata1 corosync[2515293]:   [TOTEM ] A new membership (1.1a41) was formed. Members left: 14
Aug 28 17:01:08 srvdata1 corosync[2515293]:   [TOTEM ] Failed to receive the leave message. failed: 14






proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-10 (running version: 6.2-10/a20769ed)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-4.15.18-15-pve: 4.15.18-40
ceph-fuse: 14.2.10-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.1-2
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-9
pve-cluster: 6.1-8
pve-container: 3.1-12
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-11
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-11
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
 
WOOOOHOOO! Solved it !
That was so stupid !

After another short break an looking at the traffic and finding some good hint in the logfile, I realized a misconfigured network setting !
Our cluster Networkcards running jumbo frames . So it was some MTU setting I forgot to set !

Shame on me :-( !

Code:
Aug 28 17:17:55 srvdata1 corosync[2515293]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 8988 bytes for host 14 link 0 but the other node is not acknowledging packets of this size.
Aug 28 17:17:55 srvdata1 corosync[2515293]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.

Case closed.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!