Last night we tried to add a new node to the cluster, it stuck on joining showing the below messages:
can't create shared ssh key database '/etc/pve/priv/authorized_keys'
(re)generate node files
generate new node certificate
unable to create directory '/etc/pve/priv' - Permission denied
We could see the new node in the GUI but couldn't select it or do any changes.
Then SUDDENLY ALL NODES in the cluster REBOOTED.
After reboot, the cluster couldn't complete the quorum for hours. finally, we had to delnode he new node and manually restart corosync on all nodes until it became stable.
Today without any reason or change, again it happened and all nodes got rebooted.
Since they also had CEPH, it was a disaster for us, many of the VMs filesystems crashed.
We've been dealing with this issue for a long time and hoped to fix it with different solutions, but no luck so far and now we are thinking to switch away from Proxmox because it is seriously impacting our business.
We have a cluster consisting of 15 nodes, 11 of them are running ceph, 4 of them only compute.
The is the network architecture:
2 x 10GB NIC for main connectivity, live migration, and Ceph HDDs
2 x 100GB NIC for CEPH SSD connectivity
2 x 1GB NIC for corosync and an isolated switches.
All latencies are constantly below 1 ms.
The last logs in the nodes were this:
d[1]: pvesr.service: Succeeded.
May 1 02:28:01 NODEX systemd[1]: Started Proxmox VE replication runner.
May 1 02:28:22 NODEX corosync[8695]: [KNET ] rx: host: 13 link: 0 is up
May 1 02:28:22 NODEX corosync[8695]: [KNET ] host: host: 13 (passive) best link: 0 (pri: 1)
May 1 02:28:22 NODEX corosync[8695]: [KNET ] pmtud: PMTUD link change for host: 13 link: 0 from 469 to 1397
May 1 02:28:23 NODEX corosync[8695]: [TOTEM ] A new membership (1.242a) was formed. Members joined: 13
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX pmxcfs[8721]: [dcdb] notice: members: 1/8721, 2/1588, 3/29298, 4/315332, 5/121099, 6/123559, 7/113673, 8/6427, 9/111210, 10/131232, 11/160320, 12/150096, 13/3742, 14/122190, 15/139621
May 1 02:28:23 NODEX pmxcfs[8721]: [dcdb] notice: starting data syncronisation
May 1 02:28:23 NODEX corosync[8695]: [QUORUM] Members[15]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
May 1 02:28:23 NODEX corosync[8695]: [MAIN ] Completed service synchronization, ready to provide service.
May 1 02:28:23 NODEX pmxcfs[8721]: [dcdb] notice: cpg_send_message retried 1 times
May 1 02:28:23 NODEX pmxcfs[8721]: [status] notice: members: 1/8721, 2/1588, 3/29298, 4/315332, 5/121099, 6/123559, 7/113673, 8/6427, 9/111210, 10/131232, 11/160320, 12/150096, 13/3742, 14/122190, 15/139621
May 1 02:28:23 NODEX pmxcfs[8721]: [status] notice: starting data syncronisation
May 1 02:28:23 NODEX pmxcfs[8721]: [dcdb] notice: received sync request (epoch 1/8721/0000000E)
May 1 02:28:23 NODEX pmxcfs[8721]: [status] notice: received sync request (epoch 1/8721/0000000E)
May 1 02:28:33 NODEX corosync[8695]: [TOTEM ] Token has not been received in 384 ms
Here is PVEVERSION -V
proxmox-ve: 6.1-2 (running kernel: 5.3.18-3-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-6
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph: 14.2.8-pve1
ceph-fuse: 14.2.8-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 4.0.1-pve1
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-23
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
can't create shared ssh key database '/etc/pve/priv/authorized_keys'
(re)generate node files
generate new node certificate
unable to create directory '/etc/pve/priv' - Permission denied
We could see the new node in the GUI but couldn't select it or do any changes.
Then SUDDENLY ALL NODES in the cluster REBOOTED.
After reboot, the cluster couldn't complete the quorum for hours. finally, we had to delnode he new node and manually restart corosync on all nodes until it became stable.
Today without any reason or change, again it happened and all nodes got rebooted.
Since they also had CEPH, it was a disaster for us, many of the VMs filesystems crashed.
We've been dealing with this issue for a long time and hoped to fix it with different solutions, but no luck so far and now we are thinking to switch away from Proxmox because it is seriously impacting our business.
We have a cluster consisting of 15 nodes, 11 of them are running ceph, 4 of them only compute.
The is the network architecture:
2 x 10GB NIC for main connectivity, live migration, and Ceph HDDs
2 x 100GB NIC for CEPH SSD connectivity
2 x 1GB NIC for corosync and an isolated switches.
All latencies are constantly below 1 ms.
The last logs in the nodes were this:
d[1]: pvesr.service: Succeeded.
May 1 02:28:01 NODEX systemd[1]: Started Proxmox VE replication runner.
May 1 02:28:22 NODEX corosync[8695]: [KNET ] rx: host: 13 link: 0 is up
May 1 02:28:22 NODEX corosync[8695]: [KNET ] host: host: 13 (passive) best link: 0 (pri: 1)
May 1 02:28:22 NODEX corosync[8695]: [KNET ] pmtud: PMTUD link change for host: 13 link: 0 from 469 to 1397
May 1 02:28:23 NODEX corosync[8695]: [TOTEM ] A new membership (1.242a) was formed. Members joined: 13
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX pmxcfs[8721]: [dcdb] notice: members: 1/8721, 2/1588, 3/29298, 4/315332, 5/121099, 6/123559, 7/113673, 8/6427, 9/111210, 10/131232, 11/160320, 12/150096, 13/3742, 14/122190, 15/139621
May 1 02:28:23 NODEX pmxcfs[8721]: [dcdb] notice: starting data syncronisation
May 1 02:28:23 NODEX corosync[8695]: [QUORUM] Members[15]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
May 1 02:28:23 NODEX corosync[8695]: [MAIN ] Completed service synchronization, ready to provide service.
May 1 02:28:23 NODEX pmxcfs[8721]: [dcdb] notice: cpg_send_message retried 1 times
May 1 02:28:23 NODEX pmxcfs[8721]: [status] notice: members: 1/8721, 2/1588, 3/29298, 4/315332, 5/121099, 6/123559, 7/113673, 8/6427, 9/111210, 10/131232, 11/160320, 12/150096, 13/3742, 14/122190, 15/139621
May 1 02:28:23 NODEX pmxcfs[8721]: [status] notice: starting data syncronisation
May 1 02:28:23 NODEX pmxcfs[8721]: [dcdb] notice: received sync request (epoch 1/8721/0000000E)
May 1 02:28:23 NODEX pmxcfs[8721]: [status] notice: received sync request (epoch 1/8721/0000000E)
May 1 02:28:33 NODEX corosync[8695]: [TOTEM ] Token has not been received in 384 ms
Here is PVEVERSION -V
proxmox-ve: 6.1-2 (running kernel: 5.3.18-3-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-6
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph: 14.2.8-pve1
ceph-fuse: 14.2.8-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 4.0.1-pve1
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-23
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1