Hi
This will be a a quite long text as I'm trying to put all Info together, so please bear with me.
tldr: Joining a node to an existing cluster fails with several "permission denied" errors.
So a bit of context:
I'm currently in the process of moving my homelab from vmware to pve.
The complete setup consits of:
2 x HP 360p G8
1 x HP 380 G6
2 x HP Prodesk (I know, not a server but there quite useful for me)
Each Computer has 2 NICs:
The first one is used for accessing the managment GUI and also for the Cluster Network.
It's a seperated VLAN and except the management Interfaces there is only the Gateway and the ILO Itnerfaces. This should be fine.
The second one is for accessing an NFS Storage and also the regular VM traffic. Not optimal to mix both but I don't have more NICS available and its a homelab after all, so no heavy traffice to be expected. Also this should not impact the cluster network.
Each managment NIC is on 1GBit Copper.
The data interfaces are a bit mixed: The G8 have 10Gbit SR Transceivers, the G6 has a 3x1G LACP Trunk, the Prodesks have a regular 1GBit NIC.
Everything is connected with an Aruba JL357A 2540-48G-PoE+-4SFP+ Switch.
Hostnames: vhx01-vhx05
10.151.27.111-115: Managment Interfaces (and corosync network). All IP's are in DNS and can resolve in both directions
172.16.22.11-14: Storage Network.
10.150.0.1-5: VLAN on top of vmbr0 for Migration
vmbr0 sits on the corresponding NIC used for storage/vm traffic
Installation was quite simple. I used the current .iso and an USB Key. No host was updated so far to avoid issues with the cluster setup. Everthing should be on the same version. All nodes are configured identically except slightly different network configuration because of the different hardware.
Each node can ping each other node on all Interfaces.
After installing vhx01 I created the cluster and installed vhx02-vhx05.
Each node was joined the cluster via GUI and vhx02, vhx03 and vhx05 were joined without any isses.
Unfortunately joining vhx04 resulted in the GUI on all nodes beeing unresponsive.
Removing vhx04 from the cluster brought back the GUI.
What I tried so far (apart from searching the web):
Double and triplechecked the VLANs on the switch. They are fine.
Doublechecked time synchronization: All nodes are synchronized and have no percievable difference in time.
I wiped every trace of vhx04 from the other nodes, reinstalled vhx04 and tried again with the same result.
After removing vhx04 again and wiping everything, I reinstalled with a different IP (10.151.27.116) and hostname (vhx06) with the same result.
Using another node (vhx02) than vhx01 for joining with the same result.
Using the CLI to join, gives me this:
now it hangs for about 5 minutes
In the meantime, the GUI of the nodes is getting unresponsive.
Accessing the clusters gui throws errors (connection refused) when selecting the node
Accessing the "cluster" part of the GUI, I'm getting errors about missing /etc/pve/pve-ssl.pem
After about 5 minutes I'm getting this:
Accessing any file within /etc/pve results in a complete hanging system on any node.
node vhx is accessible via ssh only.
As all other nodes (including the other Prodesk) were able to join, I have no idea why this one fails so catastrophically.
I'm open for suggestions on were I can check, if any info is required, I'm happy to provide.
greetings,
chipmonk
This will be a a quite long text as I'm trying to put all Info together, so please bear with me.
tldr: Joining a node to an existing cluster fails with several "permission denied" errors.
So a bit of context:
I'm currently in the process of moving my homelab from vmware to pve.
The complete setup consits of:
2 x HP 360p G8
1 x HP 380 G6
2 x HP Prodesk (I know, not a server but there quite useful for me)
Each Computer has 2 NICs:
The first one is used for accessing the managment GUI and also for the Cluster Network.
It's a seperated VLAN and except the management Interfaces there is only the Gateway and the ILO Itnerfaces. This should be fine.
The second one is for accessing an NFS Storage and also the regular VM traffic. Not optimal to mix both but I don't have more NICS available and its a homelab after all, so no heavy traffice to be expected. Also this should not impact the cluster network.
Each managment NIC is on 1GBit Copper.
The data interfaces are a bit mixed: The G8 have 10Gbit SR Transceivers, the G6 has a 3x1G LACP Trunk, the Prodesks have a regular 1GBit NIC.
Everything is connected with an Aruba JL357A 2540-48G-PoE+-4SFP+ Switch.
Hostnames: vhx01-vhx05
10.151.27.111-115: Managment Interfaces (and corosync network). All IP's are in DNS and can resolve in both directions
172.16.22.11-14: Storage Network.
10.150.0.1-5: VLAN on top of vmbr0 for Migration
vmbr0 sits on the corresponding NIC used for storage/vm traffic
Installation was quite simple. I used the current .iso and an USB Key. No host was updated so far to avoid issues with the cluster setup. Everthing should be on the same version. All nodes are configured identically except slightly different network configuration because of the different hardware.
Each node can ping each other node on all Interfaces.
After installing vhx01 I created the cluster and installed vhx02-vhx05.
Each node was joined the cluster via GUI and vhx02, vhx03 and vhx05 were joined without any isses.
Unfortunately joining vhx04 resulted in the GUI on all nodes beeing unresponsive.
Removing vhx04 from the cluster brought back the GUI.
What I tried so far (apart from searching the web):
Double and triplechecked the VLANs on the switch. They are fine.
Doublechecked time synchronization: All nodes are synchronized and have no percievable difference in time.
I wiped every trace of vhx04 from the other nodes, reinstalled vhx04 and tried again with the same result.
After removing vhx04 again and wiping everything, I reinstalled with a different IP (10.151.27.116) and hostname (vhx06) with the same result.
Using another node (vhx02) than vhx01 for joining with the same result.
Using the CLI to join, gives me this:
Code:
root@vhx04:~# pvecm add 10.151.27.112
Please enter superuser (root) password for '10.151.27.112': *************
Establishing API connection with host '10.151.27.112'
The authenticity of host '10.151.27.112' can't be established.
X509 SHA256 key fingerprint is ED:E7:(removed the rest)
Are you sure you want to continue connecting (yes/no)? yes
Login succeeded.
check cluster join API version
No cluster network links passed explicitly, fallback to local node IP '10.151.27.114'
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1731088444.sql.gz'
waiting for quorum...OK
now it hangs for about 5 minutes
In the meantime, the GUI of the nodes is getting unresponsive.
Accessing the clusters gui throws errors (connection refused) when selecting the node
Accessing the "cluster" part of the GUI, I'm getting errors about missing /etc/pve/pve-ssl.pem
After about 5 minutes I'm getting this:
Code:
can't create shared ssh key database '/etc/pve/priv/authorized_keys'
(re)generate node files
generate new node certificate
unable to create directory '/etc/pve/nodes' - Permission denied
Accessing any file within /etc/pve results in a complete hanging system on any node.
node vhx is accessible via ssh only.
As all other nodes (including the other Prodesk) were able to join, I have no idea why this one fails so catastrophically.
I'm open for suggestions on were I can check, if any info is required, I'm happy to provide.
greetings,
chipmonk