8.1.3 - Cluster crashes when adding new node to cluster

RickardNilsson · Nov 25, 2023

Hi,

I currently have a cluster with 5 nodes. When trying to add a sixth node to the cluster the entire cluster becomes unstable and the nodes restart one after another.

The nodes are connected to a 10 Gb switch, separate VLANs for cluster traffic and ceph.

All nodes are reachable from each other, both on the subnet used for the custer network(10.0.3.x/24) as well as the subnet used for the web interface (192.168.1.x/24).

There is no particular error logged on the different nodes before they restart but below I include some of the log entries I have found in the different machines)

Code:

Nov 25 18:26:31 rilleProx3 corosync[2008]:   [MAIN  ] *** cs_ipcs_msg_process() (2:2 - 1) No buffer space available! (there are plenty of free space available in the root partition)

Nov 25 18:27:58 rilleProx1 pve-ha-lrm[7857]: cluster file system update failed - ipcc_send_rec[1] failed: Connection refused
Nov 25 18:28:01 rilleProx1 cron[2281]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Nov 25 18:28:02 rilleProx1 pve-ha-crm[3362]: cluster file system update failed - ipcc_send_rec[1] failed: Connection refused
Nov 25 18:28:03 rilleProx1 pve-ha-lrm[7857]: cluster file system update failed - ipcc_send_rec[1] failed: Connection refused
Nov 25 18:28:07 rilleProx1 pve-ha-crm[3362]: cluster file system update failed - ipcc_send_rec[1] failed: Connection refused
Nov 25 18:28:07 rilleProx1 pve-firewall[2637]: status update error: Connection refused

Nov 25 18:18:28 rilleProx1 pveproxy[7850]: unable to read file '/etc/pve/nodes/rilleProx5/lrm_status' (rilleProx5 is the new machine).

As soon as I shutdown the new machine everything goes back to green. It's a bit difficult to debug as well since the entire cluster gets unstable when I switch on the new node.

When adding the new node I got an error message, screenshot added
.

And the pre folder looked like this on the new node:

And one of the working nodes now complains about:

Code:

Nov 25 18:57:47 rilleProx2 pve-ha-crm[3141]: unable to read file '/etc/pve/nodes/rilleProx5/lrm_status'

And als in that folder gives:

Code:

root@rilleProx2:~# ls -l /etc/pve/nodes/rilleProx5/
total 0
-rw-r----- 1 root www-data 0 Nov 25 18:17 lrm_status.tmp.2324
drwxr-xr-x 2 root www-data 0 Nov 25 17:00 lxc
drwxr-xr-x 2 root www-data 0 Nov 25 17:00 openvz
drwx------ 2 root www-data 0 Nov 25 17:00 priv
-rw-r----- 1 root www-data 0 Nov 25 18:13 pve-ssl.key
drwxr-xr-x 2 root www-data 0 Nov 25 17:00 qemu-server

lrm_status.tmp.2324 is empty.

Has anyone faced a similar issue and know how to resolve it? It has always worked perfectly good for me before to add new nodes to different clusters.

Best regards,
Rickard

RickardNilsson · Nov 25, 2023

And now when checking the cluster status from another node I get this error:

'/etc/pve/nodes/rilleProx5/pve-ssl.pem' does not exist! (500)

foglertech · Dec 11, 2023

By chance, did you have a node with the same name as the new node in the cluster at any point? Something similar happened to me when I tried adding a node with the same name as a node that had previously been on the cluster, despite following the instructions from Proxmox's instructions about deleting nodes from a cluster.

Search

Search

8.1.3 - Cluster crashes when adding new node to cluster

RickardNilsson

New Member

RickardNilsson

New Member

foglertech

Member

We value your privacy