We have just had a serious issue with our cluster.
We have 7 nodes in total, 4 of which are also running Ceph and around 400 VM's running. We were in the process of adding an 8th node, and after adding it to the cluster everything started to lock up.
Upon investigation, it appeared that every single Proxmox node (apart from the new one) had rebooted. Upon reboot, the cluster was not fully up, with loads of errors on the console, only when we disconnected the new node from the network did everything spring back to life.
We then took a closer look at the networking config on node 8. We had made a mistake with the VLAN assignments. Cluster networks 1 + 0 had the wrong VLAN on them, so they would not be able to communicate with the other nodes. Cluster network 2 was correct, and this was the IP address that we used to join the cluster with.
I appreciate that this was an error on our part, but how on earth can a fat finger mistake like this cause the entire cluster to fall on its arse? Surely as the others had a serious majority node 8 would just be marked as offline?
We are in the process of rebuilding node8 and hopefully will be able to join it without issue this time.
We have 7 nodes in total, 4 of which are also running Ceph and around 400 VM's running. We were in the process of adding an 8th node, and after adding it to the cluster everything started to lock up.
Upon investigation, it appeared that every single Proxmox node (apart from the new one) had rebooted. Upon reboot, the cluster was not fully up, with loads of errors on the console, only when we disconnected the new node from the network did everything spring back to life.
We then took a closer look at the networking config on node 8. We had made a mistake with the VLAN assignments. Cluster networks 1 + 0 had the wrong VLAN on them, so they would not be able to communicate with the other nodes. Cluster network 2 was correct, and this was the IP address that we used to join the cluster with.
I appreciate that this was an error on our part, but how on earth can a fat finger mistake like this cause the entire cluster to fall on its arse? Surely as the others had a serious majority node 8 would just be marked as offline?
We are in the process of rebuilding node8 and hopefully will be able to join it without issue this time.