Regarding the large cluster I spoke about in my previous thread.
I spent a solid day on site reinstalling every system and replumbing everything, and got it all running again along with separate individual networks for three clusters of 8-10 nodes each.
Yesterday, I started setting up the first cluster. The first two nodes clustered without issue. When I came to node 3, I realized too late I'd made an error on my network configuration and the third node could not communicate with the first two. I tried to fix this by manually taking everything offline and fixing corosync.conf, then bringing it all back up, but could never get the nodes to communicate regardless of what I did.
I attempted to remove the third node and rejoin it via the process documented here; I was able to simply use pvecm delnode to remove the system from nodes 1 and 2, and I removed /etc/pve/nodes/node3 from the errant node along with the other files specified in the documentation and after a restart node 3 came back up and worked OK as a standalone node.
When I tried to add node 3 back to the cluster, though, it does not work correctly, and the web UI throws a certificate error - Connection error 596 -tls_process_server_certificate: certificate verify failed. The logs make this pretty obvious why:
sure enough /etc/pve/nodes/pve03 on the working two nodes does not include a certificate pair and searching for pve-ssl.pem/key on node 3 does not return any files. Nor is there an /etc/pve/nodes dir on the third node at all.
I'd very much like to be able to repair this without having to wipe and reinstall all these nodes again. If anyone has any suggestions I'd love to hear them.
I spent a solid day on site reinstalling every system and replumbing everything, and got it all running again along with separate individual networks for three clusters of 8-10 nodes each.
Yesterday, I started setting up the first cluster. The first two nodes clustered without issue. When I came to node 3, I realized too late I'd made an error on my network configuration and the third node could not communicate with the first two. I tried to fix this by manually taking everything offline and fixing corosync.conf, then bringing it all back up, but could never get the nodes to communicate regardless of what I did.
I attempted to remove the third node and rejoin it via the process documented here; I was able to simply use pvecm delnode to remove the system from nodes 1 and 2, and I removed /etc/pve/nodes/node3 from the errant node along with the other files specified in the documentation and after a restart node 3 came back up and worked OK as a standalone node.
When I tried to add node 3 back to the cluster, though, it does not work correctly, and the web UI throws a certificate error - Connection error 596 -tls_process_server_certificate: certificate verify failed. The logs make this pretty obvious why:
Code:
Nov 22 11:09:09 pve01 pveproxy[759416]: '/etc/pve/nodes/pve03/pve-ssl.pem' does not exist!
sure enough /etc/pve/nodes/pve03 on the working two nodes does not include a certificate pair and searching for pve-ssl.pem/key on node 3 does not return any files. Nor is there an /etc/pve/nodes dir on the third node at all.
I'd very much like to be able to repair this without having to wipe and reinstall all these nodes again. If anyone has any suggestions I'd love to hear them.
Last edited: