Faulty cluster, randomly reset all nodes, can't add a new node

Pourya Mehdinejad · May 1, 2020

Last night we tried to add a new node to the cluster, it stuck on joining showing the below messages:

can't create shared ssh key database '/etc/pve/priv/authorized_keys'
(re)generate node files
generate new node certificate
unable to create directory '/etc/pve/priv' - Permission denied

We could see the new node in the GUI but couldn't select it or do any changes.
Then SUDDENLY ALL NODES in the cluster REBOOTED.
After reboot, the cluster couldn't complete the quorum for hours. finally, we had to delnode he new node and manually restart corosync on all nodes until it became stable.
Today without any reason or change, again it happened and all nodes got rebooted.

Since they also had CEPH, it was a disaster for us, many of the VMs filesystems crashed.

We've been dealing with this issue for a long time and hoped to fix it with different solutions, but no luck so far and now we are thinking to switch away from Proxmox because it is seriously impacting our business.
We have a cluster consisting of 15 nodes, 11 of them are running ceph, 4 of them only compute.
The is the network architecture:
2 x 10GB NIC for main connectivity, live migration, and Ceph HDDs
2 x 100GB NIC for CEPH SSD connectivity
2 x 1GB NIC for corosync and an isolated switches.

All latencies are constantly below 1 ms.

The last logs in the nodes were this:

d[1]: pvesr.service: Succeeded.
May 1 02:28:01 NODEX systemd[1]: Started Proxmox VE replication runner.
May 1 02:28:22 NODEX corosync[8695]: [KNET ] rx: host: 13 link: 0 is up
May 1 02:28:22 NODEX corosync[8695]: [KNET ] host: host: 13 (passive) best link: 0 (pri: 1)
May 1 02:28:22 NODEX corosync[8695]: [KNET ] pmtud: PMTUD link change for host: 13 link: 0 from 469 to 1397
May 1 02:28:23 NODEX corosync[8695]: [TOTEM ] A new membership (1.242a) was formed. Members joined: 13
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX pmxcfs[8721]: [dcdb] notice: members: 1/8721, 2/1588, 3/29298, 4/315332, 5/121099, 6/123559, 7/113673, 8/6427, 9/111210, 10/131232, 11/160320, 12/150096, 13/3742, 14/122190, 15/139621
May 1 02:28:23 NODEX pmxcfs[8721]: [dcdb] notice: starting data syncronisation
May 1 02:28:23 NODEX corosync[8695]: [QUORUM] Members[15]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
May 1 02:28:23 NODEX corosync[8695]: [MAIN ] Completed service synchronization, ready to provide service.
May 1 02:28:23 NODEX pmxcfs[8721]: [dcdb] notice: cpg_send_message retried 1 times
May 1 02:28:23 NODEX pmxcfs[8721]: [status] notice: members: 1/8721, 2/1588, 3/29298, 4/315332, 5/121099, 6/123559, 7/113673, 8/6427, 9/111210, 10/131232, 11/160320, 12/150096, 13/3742, 14/122190, 15/139621
May 1 02:28:23 NODEX pmxcfs[8721]: [status] notice: starting data syncronisation
May 1 02:28:23 NODEX pmxcfs[8721]: [dcdb] notice: received sync request (epoch 1/8721/0000000E)
May 1 02:28:23 NODEX pmxcfs[8721]: [status] notice: received sync request (epoch 1/8721/0000000E)
May 1 02:28:33 NODEX corosync[8695]: [TOTEM ] Token has not been received in 384 ms

Here is PVEVERSION -V

proxmox-ve: 6.1-2 (running kernel: 5.3.18-3-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-6
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph: 14.2.8-pve1
ceph-fuse: 14.2.8-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 4.0.1-pve1
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-23
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

Dominic · May 14, 2020

Hi!

Have you already been able to add that node?

Has everything been stable prior to the first try to add the new node? Messages like

notice: cpg_send_message retried 1 times
May 1 02:28:33 NODEX corosync[8695]: [TOTEM ] Token has not been received in 384 ms

indicate that the cluster network might not be working ideally. Rebooting nodes can be a consequence of that.

If the problem is still there, please post

Code:

pvecm status
cat /etc/pve/corosync.conf
ha-manager status

Pourya Mehdinejad · May 14, 2020

Dominic said:
Hi!

Have you already been able to add that node?

Has everything been stable prior to the first try to add the new node? Messages like

indicate that the cluster network might not be working ideally. Rebooting nodes can be a consequence of that.

If the problem is still there, please post

Code:

pvecm status cat /etc/pve/corosync.conf ha-manager status

Thanks for your response.
The cluster was working fine before that.
but we couldn't add that node to the cluster yet, as the cluster has been acting weird ever since. We had a reboot and corosync hung the next day without us touching anything.
We 've been working with your support to find the issue.
So far we realized the following:

- After this we have removed all VMs from HA. So ha-manager status shows all nodes are idle, therefore we will not have any reboot. However corosync still hangs and if we leave it for long enough, it will cause the nodes kernel to hung and loose their connectivity.

- Corosync network is totally ok, There is no packet lost either in switch or NIC. We have testes simultaneously at the same time that corosync was complaining about token not received, the latency in the corosync network was still less than 0.200 ms.

- Apparantly it's been a corosync bug, and we had older corosync packages ( libknet 1.14) on some nodes in our cluster. We've been trying to upgrade all to the latest package (libknet 1.15) to see if it fixes the issue.
During the upgrade of each node, corosync hangs again and we have to manually stop / start corosync and pve-cluster

Dzung · Aug 25, 2020

I have same issue, one node in cluster reboot, all node will reboot I used proxmox 6.2.4

Search

Search

Faulty cluster, randomly reset all nodes, can't add a new node

Pourya Mehdinejad

Member

Dominic

Proxmox Retired Staff

Pourya Mehdinejad

Member

Dzung

Well-Known Member

We value your privacy