Hoping someone can point me in the right direction...
I have a fully functioning 4 node cluster also running ceph. All nodes have four 10g interfaces which all connect to the same 10g switch. This is a new cluster with new (to me) hardware. Everything has been functioning fine for a few weeks now.
I had a previous cluster (with the cluster network on a separate VLAN) that I virtualize pfSense on Proxmox. It was still on my old cluster. I am trying to move it to my new cluster. Here is what I have done...
Moved pfSense to one of my old servers with standalone PVE install. Restored pfSense config and everything is working correctly, except randomly slow to respond internet which was never an issue before.
Reinstalled PVE on the machine I was running pfSense on - fresh install. So at this point, the old cluster is completely gone. As mentioned, I have pfSense running on a standalone server running PVE but NOT in the cluster.
I have tried three times to add this new node to the cluster. Install goes fine, and log into the web ui and add the cluster network as a bridge (just like I did on all other nodes). I plug an ethernet cable in for the cluster network. From my desktop I can ping the IP of the cluster network for the new node so routing is working. As soon as I click the join cluster button on the new node it gives the output as normal and looks like it added to the cluster fine. On any of the existing nodes on the cluster I can see the new node listed in the gui but with a "red x." Soon, the entire cluster goes down. None of the gui's are available (or with limited availability). As soon as I unplug the CLUSTER network from the new node, the cluster comes back to life (sometimes needing to restart all VM's like they were rebooted). If I plug the ethernet cable back in the CLUSTER network on the new node, it goes down again. I have tried two different ethernet ports on the new node thinking there could be a NIC issue, but that doesn't seem to be the case since it happens on both and I previously had this same machine in my old cluster working fine. All nodes are connected via 10gb to the same switch but the uplink is on 1gb due to using the old server as pfSense for a brief moment - or until I can get this new node into the cluster and install pfSense on it. This sure seems to be a network issue (and more specifically routing since that's all that has changed). Any tips on how I can figure out what might be going on?
pvecm status shows all nodes (including the new node) but obviously shows it as offline due to it being unplugged.
Considering installing pfSense baremetal, but I was running this same machine in my previous cluster without issue and just can't figure out why it's taking down the entire cluster. All machines including the new node are running latest PVE with all non-subscription updates. Each time I tried to add the new node I used the same hostname (pve1), but before trying to add it each time I used pvecm del node to remove it from the cluster and deleted 'pve1' from /etc/pve/nodes/
Where should I begin to look for what is taking down corosync? Thanks in advance!
I have a fully functioning 4 node cluster also running ceph. All nodes have four 10g interfaces which all connect to the same 10g switch. This is a new cluster with new (to me) hardware. Everything has been functioning fine for a few weeks now.
I had a previous cluster (with the cluster network on a separate VLAN) that I virtualize pfSense on Proxmox. It was still on my old cluster. I am trying to move it to my new cluster. Here is what I have done...
Moved pfSense to one of my old servers with standalone PVE install. Restored pfSense config and everything is working correctly, except randomly slow to respond internet which was never an issue before.
Reinstalled PVE on the machine I was running pfSense on - fresh install. So at this point, the old cluster is completely gone. As mentioned, I have pfSense running on a standalone server running PVE but NOT in the cluster.
I have tried three times to add this new node to the cluster. Install goes fine, and log into the web ui and add the cluster network as a bridge (just like I did on all other nodes). I plug an ethernet cable in for the cluster network. From my desktop I can ping the IP of the cluster network for the new node so routing is working. As soon as I click the join cluster button on the new node it gives the output as normal and looks like it added to the cluster fine. On any of the existing nodes on the cluster I can see the new node listed in the gui but with a "red x." Soon, the entire cluster goes down. None of the gui's are available (or with limited availability). As soon as I unplug the CLUSTER network from the new node, the cluster comes back to life (sometimes needing to restart all VM's like they were rebooted). If I plug the ethernet cable back in the CLUSTER network on the new node, it goes down again. I have tried two different ethernet ports on the new node thinking there could be a NIC issue, but that doesn't seem to be the case since it happens on both and I previously had this same machine in my old cluster working fine. All nodes are connected via 10gb to the same switch but the uplink is on 1gb due to using the old server as pfSense for a brief moment - or until I can get this new node into the cluster and install pfSense on it. This sure seems to be a network issue (and more specifically routing since that's all that has changed). Any tips on how I can figure out what might be going on?
pvecm status shows all nodes (including the new node) but obviously shows it as offline due to it being unplugged.
Considering installing pfSense baremetal, but I was running this same machine in my previous cluster without issue and just can't figure out why it's taking down the entire cluster. All machines including the new node are running latest PVE with all non-subscription updates. Each time I tried to add the new node I used the same hostname (pve1), but before trying to add it each time I used pvecm del node to remove it from the cluster and deleted 'pve1' from /etc/pve/nodes/
Where should I begin to look for what is taking down corosync? Thanks in advance!