Hi,
I currently have a cluster with 5 nodes. When trying to add a sixth node to the cluster the entire cluster becomes unstable and the nodes restart one after another.
The nodes are connected to a 10 Gb switch, separate VLANs for cluster traffic and ceph.
All nodes are reachable from each other, both on the subnet used for the custer network(10.0.3.x/24) as well as the subnet used for the web interface (192.168.1.x/24).
There is no particular error logged on the different nodes before they restart but below I include some of the log entries I have found in the different machines)
As soon as I shutdown the new machine everything goes back to green. It's a bit difficult to debug as well since the entire cluster gets unstable when I switch on the new node.
When adding the new node I got an error message, screenshot added
.
And the pre folder looked like this on the new node:
And one of the working nodes now complains about:
And als in that folder gives:
lrm_status.tmp.2324 is empty.
Has anyone faced a similar issue and know how to resolve it? It has always worked perfectly good for me before to add new nodes to different clusters.
Best regards,
Rickard
I currently have a cluster with 5 nodes. When trying to add a sixth node to the cluster the entire cluster becomes unstable and the nodes restart one after another.
The nodes are connected to a 10 Gb switch, separate VLANs for cluster traffic and ceph.
All nodes are reachable from each other, both on the subnet used for the custer network(10.0.3.x/24) as well as the subnet used for the web interface (192.168.1.x/24).
There is no particular error logged on the different nodes before they restart but below I include some of the log entries I have found in the different machines)
Code:
Nov 25 18:26:31 rilleProx3 corosync[2008]: [MAIN ] *** cs_ipcs_msg_process() (2:2 - 1) No buffer space available! (there are plenty of free space available in the root partition)
Nov 25 18:27:58 rilleProx1 pve-ha-lrm[7857]: cluster file system update failed - ipcc_send_rec[1] failed: Connection refused
Nov 25 18:28:01 rilleProx1 cron[2281]: (*system*vzdump) CAN'T OPEN SYMLINK (/etc/cron.d/vzdump)
Nov 25 18:28:02 rilleProx1 pve-ha-crm[3362]: cluster file system update failed - ipcc_send_rec[1] failed: Connection refused
Nov 25 18:28:03 rilleProx1 pve-ha-lrm[7857]: cluster file system update failed - ipcc_send_rec[1] failed: Connection refused
Nov 25 18:28:07 rilleProx1 pve-ha-crm[3362]: cluster file system update failed - ipcc_send_rec[1] failed: Connection refused
Nov 25 18:28:07 rilleProx1 pve-firewall[2637]: status update error: Connection refused
Nov 25 18:18:28 rilleProx1 pveproxy[7850]: unable to read file '/etc/pve/nodes/rilleProx5/lrm_status' (rilleProx5 is the new machine).
As soon as I shutdown the new machine everything goes back to green. It's a bit difficult to debug as well since the entire cluster gets unstable when I switch on the new node.
When adding the new node I got an error message, screenshot added
.
And the pre folder looked like this on the new node:
And one of the working nodes now complains about:
Code:
Nov 25 18:57:47 rilleProx2 pve-ha-crm[3141]: unable to read file '/etc/pve/nodes/rilleProx5/lrm_status'
And als in that folder gives:
Code:
root@rilleProx2:~# ls -l /etc/pve/nodes/rilleProx5/
total 0
-rw-r----- 1 root www-data 0 Nov 25 18:17 lrm_status.tmp.2324
drwxr-xr-x 2 root www-data 0 Nov 25 17:00 lxc
drwxr-xr-x 2 root www-data 0 Nov 25 17:00 openvz
drwx------ 2 root www-data 0 Nov 25 17:00 priv
-rw-r----- 1 root www-data 0 Nov 25 18:13 pve-ssl.key
drwxr-xr-x 2 root www-data 0 Nov 25 17:00 qemu-server
lrm_status.tmp.2324 is empty.
Has anyone faced a similar issue and know how to resolve it? It has always worked perfectly good for me before to add new nodes to different clusters.
Best regards,
Rickard
Last edited: