8.1.3 - Cluster crashes when adding new node to cluster

RickardNilsson

New Member
Nov 25, 2023
3
0
1
Hi,

I currently have a cluster with 5 nodes. When trying to add a sixth node to the cluster the entire cluster becomes unstable and the nodes restart one after another.

The nodes are connected to a 10 Gb switch, separate VLANs for cluster traffic and ceph.

All nodes are reachable from each other, both on the subnet used for the custer network(10.0.3.x/24) as well as the subnet used for the web interface (192.168.1.x/24).

There is no particular error logged on the different nodes before they restart but below I include some of the log entries I have found in the different machines)

Code:
Nov 25 18:26:31 rilleProx3 corosync[2008]:   [MAIN  ] *** cs_ipcs_msg_process() (2:2 - 1) No buffer space available! (there are plenty of free space available in the root partition)

Nov 25 18:27:58 rilleProx1 pve-ha-lrm[7857]: cluster file system update failed - ipcc_send_rec[1] failed: Connection refused
Nov 25 18:28:01 rilleProx1 cron[2281]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Nov 25 18:28:02 rilleProx1 pve-ha-crm[3362]: cluster file system update failed - ipcc_send_rec[1] failed: Connection refused
Nov 25 18:28:03 rilleProx1 pve-ha-lrm[7857]: cluster file system update failed - ipcc_send_rec[1] failed: Connection refused
Nov 25 18:28:07 rilleProx1 pve-ha-crm[3362]: cluster file system update failed - ipcc_send_rec[1] failed: Connection refused
Nov 25 18:28:07 rilleProx1 pve-firewall[2637]: status update error: Connection refused

Nov 25 18:18:28 rilleProx1 pveproxy[7850]: unable to read file '/etc/pve/nodes/rilleProx5/lrm_status' (rilleProx5 is the new machine).

As soon as I shutdown the new machine everything goes back to green. It's a bit difficult to debug as well since the entire cluster gets unstable when I switch on the new node.

When adding the new node I got an error message, screenshot added
. Screenshot 2023-11-25 at 18.48.05.png
And the pre folder looked like this on the new node:
Untitled.jpg

And one of the working nodes now complains about:
Code:
Nov 25 18:57:47 rilleProx2 pve-ha-crm[3141]: unable to read file '/etc/pve/nodes/rilleProx5/lrm_status'

And als in that folder gives:
Code:
root@rilleProx2:~# ls -l /etc/pve/nodes/rilleProx5/
total 0
-rw-r----- 1 root www-data 0 Nov 25 18:17 lrm_status.tmp.2324
drwxr-xr-x 2 root www-data 0 Nov 25 17:00 lxc
drwxr-xr-x 2 root www-data 0 Nov 25 17:00 openvz
drwx------ 2 root www-data 0 Nov 25 17:00 priv
-rw-r----- 1 root www-data 0 Nov 25 18:13 pve-ssl.key
drwxr-xr-x 2 root www-data 0 Nov 25 17:00 qemu-server

lrm_status.tmp.2324 is empty.

Has anyone faced a similar issue and know how to resolve it? It has always worked perfectly good for me before to add new nodes to different clusters.

Best regards,
Rickard
 
Last edited:
And now when checking the cluster status from another node I get this error:

'/etc/pve/nodes/rilleProx5/pve-ssl.pem' does not exist! (500)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!