Faulty cluster, randomly reset all nodes, can't add a new node

Nov 8, 2017
99
3
13
33
Muscat
Last night we tried to add a new node to the cluster, it stuck on joining showing the below messages:

can't create shared ssh key database '/etc/pve/priv/authorized_keys'
(re)generate node files
generate new node certificate
unable to create directory '/etc/pve/priv' - Permission denied

We could see the new node in the GUI but couldn't select it or do any changes.
Then SUDDENLY ALL NODES in the cluster REBOOTED.
After reboot, the cluster couldn't complete the quorum for hours. finally, we had to delnode he new node and manually restart corosync on all nodes until it became stable.
Today without any reason or change, again it happened and all nodes got rebooted.



Since they also had CEPH, it was a disaster for us, many of the VMs filesystems crashed.


We've been dealing with this issue for a long time and hoped to fix it with different solutions, but no luck so far and now we are thinking to switch away from Proxmox because it is seriously impacting our business.
We have a cluster consisting of 15 nodes, 11 of them are running ceph, 4 of them only compute.
The is the network architecture:
2 x 10GB NIC for main connectivity, live migration, and Ceph HDDs
2 x 100GB NIC for CEPH SSD connectivity
2 x 1GB NIC for corosync and an isolated switches.

All latencies are constantly below 1 ms.


The last logs in the nodes were this:

d[1]: pvesr.service: Succeeded.
May 1 02:28:01 NODEX systemd[1]: Started Proxmox VE replication runner.
May 1 02:28:22 NODEX corosync[8695]: [KNET ] rx: host: 13 link: 0 is up
May 1 02:28:22 NODEX corosync[8695]: [KNET ] host: host: 13 (passive) best link: 0 (pri: 1)
May 1 02:28:22 NODEX corosync[8695]: [KNET ] pmtud: PMTUD link change for host: 13 link: 0 from 469 to 1397
May 1 02:28:23 NODEX corosync[8695]: [TOTEM ] A new membership (1.242a) was formed. Members joined: 13
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX corosync[8695]: [CPG ] downlist left_list: 0 received
May 1 02:28:23 NODEX pmxcfs[8721]: [dcdb] notice: members: 1/8721, 2/1588, 3/29298, 4/315332, 5/121099, 6/123559, 7/113673, 8/6427, 9/111210, 10/131232, 11/160320, 12/150096, 13/3742, 14/122190, 15/139621
May 1 02:28:23 NODEX pmxcfs[8721]: [dcdb] notice: starting data syncronisation
May 1 02:28:23 NODEX corosync[8695]: [QUORUM] Members[15]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
May 1 02:28:23 NODEX corosync[8695]: [MAIN ] Completed service synchronization, ready to provide service.
May 1 02:28:23 NODEX pmxcfs[8721]: [dcdb] notice: cpg_send_message retried 1 times
May 1 02:28:23 NODEX pmxcfs[8721]: [status] notice: members: 1/8721, 2/1588, 3/29298, 4/315332, 5/121099, 6/123559, 7/113673, 8/6427, 9/111210, 10/131232, 11/160320, 12/150096, 13/3742, 14/122190, 15/139621
May 1 02:28:23 NODEX pmxcfs[8721]: [status] notice: starting data syncronisation
May 1 02:28:23 NODEX pmxcfs[8721]: [dcdb] notice: received sync request (epoch 1/8721/0000000E)
May 1 02:28:23 NODEX pmxcfs[8721]: [status] notice: received sync request (epoch 1/8721/0000000E)
May 1 02:28:33 NODEX corosync[8695]: [TOTEM ] Token has not been received in 384 ms



Here is PVEVERSION -V


proxmox-ve: 6.1-2 (running kernel: 5.3.18-3-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-6
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph: 14.2.8-pve1
ceph-fuse: 14.2.8-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 4.0.1-pve1
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-23
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 
Hi!

Have you already been able to add that node?

Has everything been stable prior to the first try to add the new node? Messages like
notice: cpg_send_message retried 1 times
May 1 02:28:33 NODEX corosync[8695]: [TOTEM ] Token has not been received in 384 ms
indicate that the cluster network might not be working ideally. Rebooting nodes can be a consequence of that.

If the problem is still there, please post
Code:
pvecm status
cat /etc/pve/corosync.conf
ha-manager status
 
  • Like
Reactions: Pourya Mehdinejad
Hi!

Have you already been able to add that node?

Has everything been stable prior to the first try to add the new node? Messages like

indicate that the cluster network might not be working ideally. Rebooting nodes can be a consequence of that.

If the problem is still there, please post
Code:
pvecm status
cat /etc/pve/corosync.conf
ha-manager status


Thanks for your response.
The cluster was working fine before that.
but we couldn't add that node to the cluster yet, as the cluster has been acting weird ever since. We had a reboot and corosync hung the next day without us touching anything.
We 've been working with your support to find the issue.
So far we realized the following:

- After this we have removed all VMs from HA. So ha-manager status shows all nodes are idle, therefore we will not have any reboot. However corosync still hangs and if we leave it for long enough, it will cause the nodes kernel to hung and loose their connectivity.

- Corosync network is totally ok, There is no packet lost either in switch or NIC. We have testes simultaneously at the same time that corosync was complaining about token not received, the latency in the corosync network was still less than 0.200 ms.

- Apparantly it's been a corosync bug, and we had older corosync packages ( libknet 1.14) on some nodes in our cluster. We've been trying to upgrade all to the latest package (libknet 1.15) to see if it fixes the issue.
During the upgrade of each node, corosync hangs again and we have to manually stop / start corosync and pve-cluster
 
  • Like
Reactions: Dominic
I have same issue, one node in cluster reboot, all node will reboot I used proxmox 6.2.4
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!