8.1.3 - Cluster crashes when adding new node to cluster

RickardNilsson

New Member
Nov 25, 2023
4
0
1
Hi,

I currently have a cluster with 5 nodes. When trying to add a sixth node to the cluster the entire cluster becomes unstable and the nodes restart one after another.

The nodes are connected to a 10 Gb switch, separate VLANs for cluster traffic and ceph.

All nodes are reachable from each other, both on the subnet used for the custer network(10.0.3.x/24) as well as the subnet used for the web interface (192.168.1.x/24).

There is no particular error logged on the different nodes before they restart but below I include some of the log entries I have found in the different machines)

Code:
Nov 25 18:26:31 rilleProx3 corosync[2008]:   [MAIN  ] *** cs_ipcs_msg_process() (2:2 - 1) No buffer space available! (there are plenty of free space available in the root partition)

Nov 25 18:27:58 rilleProx1 pve-ha-lrm[7857]: cluster file system update failed - ipcc_send_rec[1] failed: Connection refused
Nov 25 18:28:01 rilleProx1 cron[2281]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Nov 25 18:28:02 rilleProx1 pve-ha-crm[3362]: cluster file system update failed - ipcc_send_rec[1] failed: Connection refused
Nov 25 18:28:03 rilleProx1 pve-ha-lrm[7857]: cluster file system update failed - ipcc_send_rec[1] failed: Connection refused
Nov 25 18:28:07 rilleProx1 pve-ha-crm[3362]: cluster file system update failed - ipcc_send_rec[1] failed: Connection refused
Nov 25 18:28:07 rilleProx1 pve-firewall[2637]: status update error: Connection refused

Nov 25 18:18:28 rilleProx1 pveproxy[7850]: unable to read file '/etc/pve/nodes/rilleProx5/lrm_status' (rilleProx5 is the new machine).

As soon as I shutdown the new machine everything goes back to green. It's a bit difficult to debug as well since the entire cluster gets unstable when I switch on the new node.

When adding the new node I got an error message, screenshot added
. Screenshot 2023-11-25 at 18.48.05.png
And the pre folder looked like this on the new node:
Untitled.jpg

And one of the working nodes now complains about:
Code:
Nov 25 18:57:47 rilleProx2 pve-ha-crm[3141]: unable to read file '/etc/pve/nodes/rilleProx5/lrm_status'

And als in that folder gives:
Code:
root@rilleProx2:~# ls -l /etc/pve/nodes/rilleProx5/
total 0
-rw-r----- 1 root www-data 0 Nov 25 18:17 lrm_status.tmp.2324
drwxr-xr-x 2 root www-data 0 Nov 25 17:00 lxc
drwxr-xr-x 2 root www-data 0 Nov 25 17:00 openvz
drwx------ 2 root www-data 0 Nov 25 17:00 priv
-rw-r----- 1 root www-data 0 Nov 25 18:13 pve-ssl.key
drwxr-xr-x 2 root www-data 0 Nov 25 17:00 qemu-server

lrm_status.tmp.2324 is empty.

Has anyone faced a similar issue and know how to resolve it? It has always worked perfectly good for me before to add new nodes to different clusters.

Best regards,
Rickard
 
Last edited:
And now when checking the cluster status from another node I get this error:

'/etc/pve/nodes/rilleProx5/pve-ssl.pem' does not exist! (500)