Hello,
we are using PVE since Version 2 und after years of smooth operation I am now encountering some bad headache problem .
We run a 6.2 cluster with 13 nodes and want to add another 2 nodes.
After issuing pvecm add <existingnode> for the first new node the whole quorum got lost and corosync did not form new memberships anymore and looses every former member.
I had to interrupt the join operation and had to stop and restart the corosync service on all nodes. So i did pvecm del <firstnewnode> and tried again with the second new node (had to reinstall the first one, because some problems with the pmxfs).
No luck so far.
I dont know whats happening here. Maybe a problem with the versions ? The new nodes have newer kernels and pve-manager than the clusternodes (no updates available).
I never had such strange behaviour when adding new nodes. The configurations for network and the hosts files follows the same schema on every host. Nothing special.
This gives me some headache the last hours. Any suggestions on this ? Which info may I provide ?
Here are some infos:
<newnode>
<existingnode>
we are using PVE since Version 2 und after years of smooth operation I am now encountering some bad headache problem .
We run a 6.2 cluster with 13 nodes and want to add another 2 nodes.
After issuing pvecm add <existingnode> for the first new node the whole quorum got lost and corosync did not form new memberships anymore and looses every former member.
I had to interrupt the join operation and had to stop and restart the corosync service on all nodes. So i did pvecm del <firstnewnode> and tried again with the second new node (had to reinstall the first one, because some problems with the pmxfs).
No luck so far.
I dont know whats happening here. Maybe a problem with the versions ? The new nodes have newer kernels and pve-manager than the clusternodes (no updates available).
I never had such strange behaviour when adding new nodes. The configurations for network and the hosts files follows the same schema on every host. Nothing special.
This gives me some headache the last hours. Any suggestions on this ? Which info may I provide ?
Here are some infos:
<newnode>
Code:
Aug 28 16:57:58 srvhost12 pmxcfs[26519]: [status] notice: cpg_send_message retry 70
Aug 28 16:57:58 srvhost12 systemd[1]: Started Session 32 of user root.
Aug 28 16:57:59 srvhost12 corosync[26513]: [TOTEM ] A new membership (e.122d) was formed. Members left: 1 2 3 4 5 6 7 8 9 10 11 12 13
Aug 28 16:57:59 srvhost12 corosync[26513]: [TOTEM ] Failed to receive the leave message. failed: 1 2 3 4 5 6 7 8 9 10 11 12 13
Aug 28 16:57:59 srvhost12 corosync[26513]: [QUORUM] Members[1]: 14
Aug 28 16:57:59 srvhost12 corosync[26513]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 28 16:57:59 srvhost12 corosync[26513]: [TOTEM ] A new membership (1.1231) was formed. Members joined: 1 2 3 4 5 6 7 8 9 10 11 12 13
Aug 28 16:57:59 srvhost12 pmxcfs[26519]: [status] notice: cpg_send_message retry 80
Aug 28 16:58:00 srvhost12 pmxcfs[26519]: [status] notice: cpg_send_message retry 90
Aug 28 16:58:00 srvhost12 systemd[1]: Starting Proxmox VE replication runner...
Aug 28 16:58:01 srvhost12 pmxcfs[26519]: [status] notice: cpg_send_message retry 100
Aug 28 16:58:01 srvhost12 pmxcfs[26519]: [status] notice: cpg_send_message retried 100 times
Aug 28 16:58:01 srvhost12 pmxcfs[26519]: [status] crit: cpg_send_message failed: 6
Aug 28 16:58:01 srvhost12 pvesr[27714]: error during cfs-locked 'file-replication_cfg' operation: no quorum!
Aug 28 16:58:01 srvhost12 pvestatd[1371]: status update time (30.059 seconds)
Aug 28 16:58:01 srvhost12 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 28 16:58:01 srvhost12 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Aug 28 16:58:01 srvhost12 systemd[1]: Failed to start Proxmox VE replication runner.
proxmox-ve: 6.2-1 (running kernel: 5.4.55-1-pve)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-5
pve-kernel-helper: 6.2-5
pve-kernel-5.4.55-1-pve: 5.4.55-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.1-2
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-10
pve-cluster: 6.1-8
pve-container: 3.1-12
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-2
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-12
pve-xtermjs: 4.7.0-1
qemu-server: 6.2-11
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1
<existingnode>
Code:
Aug 28 16:59:30 srvdata1 corosync[2515293]: [CPG ] *** 0x55c757f710b0 can't mcast to group pve_kvstore_v1 state:1, error:12
Aug 28 16:59:30 srvdata1 corosync[2515293]: [MAIN ] qb_ipcs_event_send: Transport endpoint is not connected (107)
Aug 28 16:59:36 srvdata1 corosync[2515293]: [TOTEM ] Token has not been received in 6611 ms
Aug 28 16:59:41 srvdata1 corosync[2515293]: [TOTEM ] Retransmit List: 56
Aug 28 16:59:41 srvdata1 corosync[2515293]: [TOTEM ] A new membership (1.1a05) was formed. Members
Aug 28 16:59:41 srvdata1 corosync[2515293]: [QUORUM] Members[10]: 1 2 4 5 6 7 9 11 12 13
Aug 28 16:59:41 srvdata1 corosync[2515293]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 28 16:59:41 srvdata1 corosync[2515293]: [MAIN ] Q empty, queued:0 sent:479.
Aug 28 17:00:00 srvdata1 corosync[2515293]: [TOTEM ] A new membership (1.1a11) was formed. Members joined: 2 4 left: 2 4
Aug 28 17:00:00 srvdata1 corosync[2515293]: [TOTEM ] Failed to receive the leave message. failed: 2 4
Aug 28 17:00:00 srvdata1 corosync[2515293]: [TOTEM ] Retransmit List: 1
Aug 28 17:00:00 srvdata1 corosync[2515293]: [QUORUM] Members[10]: 1 2 4 5 6 7 9 11 12 13
Aug 28 17:00:00 srvdata1 corosync[2515293]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 28 17:00:04 srvdata1 corosync[2515293]: [KNET ] rx: host: 10 link: 0 is up
Aug 28 17:00:04 srvdata1 corosync[2515293]: [KNET ] host: host: 10 (passive) best link: 0 (pri: 1)
Aug 28 17:00:06 srvdata1 corosync[2515293]: [TOTEM ] A new membership (1.1a15) was formed. Members joined: 3
Aug 28 17:00:06 srvdata1 corosync[2515293]: [QUORUM] Members[11]: 1 2 3 4 5 6 7 9 11 12 13
Aug 28 17:00:06 srvdata1 corosync[2515293]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 28 17:00:10 srvdata1 corosync[2515293]: [TOTEM ] A new membership (1.1a19) was formed. Members joined: 8
Aug 28 17:00:11 srvdata1 corosync[2515293]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 11 12 13
Aug 28 17:00:11 srvdata1 corosync[2515293]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 28 17:00:12 srvdata1 corosync[2515293]: [TOTEM ] A new membership (1.1a1d) was formed. Members
Aug 28 17:00:12 srvdata1 corosync[2515293]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 11 12 13
Aug 28 17:00:12 srvdata1 corosync[2515293]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 28 17:00:22 srvdata1 corosync[2515293]: [TOTEM ] A new membership (1.1a21) was formed. Members
Aug 28 17:00:22 srvdata1 corosync[2515293]: [QUORUM] Members[12]: 1 2 3 4 5 6 7 8 9 11 12 13
Aug 28 17:00:22 srvdata1 corosync[2515293]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 28 17:00:22 srvdata1 corosync[2515293]: [TOTEM ] A new membership (1.1a25) was formed. Members joined: 10
Aug 28 17:00:22 srvdata1 corosync[2515293]: [QUORUM] Members[13]: 1 2 3 4 5 6 7 8 9 10 11 12 13
Aug 28 17:00:22 srvdata1 corosync[2515293]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 28 17:00:33 srvdata1 corosync[2515293]: [TOTEM ] A new membership (1.1a29) was formed. Members joined: 14
Aug 28 17:00:40 srvdata1 corosync[2515293]: [TOTEM ] A new membership (1.1a2d) was formed. Members left: 14
Aug 28 17:00:40 srvdata1 corosync[2515293]: [TOTEM ] Failed to receive the leave message. failed: 14
Aug 28 17:00:40 srvdata1 corosync[2515293]: [QUORUM] Members[13]: 1 2 3 4 5 6 7 8 9 10 11 12 13
Aug 28 17:00:40 srvdata1 corosync[2515293]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 28 17:00:50 srvdata1 corosync[2515293]: [TOTEM ] A new membership (1.1a31) was formed. Members
Aug 28 17:01:00 srvdata1 corosync[2515293]: [TOTEM ] A new membership (1.1a35) was formed. Members
Aug 28 17:01:01 srvdata1 corosync[2515293]: [TOTEM ] A new membership (1.1a39) was formed. Members
Aug 28 17:01:01 srvdata1 corosync[2515293]: [TOTEM ] A new membership (1.1a3d) was formed. Members joined: 14
Aug 28 17:01:08 srvdata1 corosync[2515293]: [TOTEM ] A new membership (1.1a41) was formed. Members left: 14
Aug 28 17:01:08 srvdata1 corosync[2515293]: [TOTEM ] Failed to receive the leave message. failed: 14
proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-10 (running version: 6.2-10/a20769ed)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-4.15.18-15-pve: 4.15.18-40
ceph-fuse: 14.2.10-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve2
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.1-2
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-9
pve-cluster: 6.1-8
pve-container: 3.1-12
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-11
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-11
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1