Early days, can't make heads or tails of it but we have a cluster of 3 nodes with HA, replication happily chugging along. When we came to add a 3rd node basically looks like all hell broke loose, all 3 nodes restarted red x's on the nodes, couldn't get in on the web interface. All VM's were abruptly stopped causing corruption and general chaos.
The 3rd node was at a satellite site as the main and satellite site are the same organisational unit so we wanted to add it in so that whole unit could be administered and backed up the same way. That remote site has a different subnet and a small amount of latency.
Problem seemed to be identical to the following post:
https://forum.proxmox.com/threads/cluster-issues-after-adding-node.89091/
That post doesn't go anywhere really, it says it broke it and something went wrong and then they removed and re-added it and it worked.
Well, we removed it and the original 3 are now happy again and I don't fancy re-adding it until I know what happened.
But... what happened?
The internet thinks its a corosync confusion issue, but by now that shouldn't be causing the entire thing to explode. We stopped cluster, then corosync then forced to local mode.
We were lucky, that we could get on one of the nodes because the site is a good distance away. It makes sense to me that we can add all parts of an OU to the same cluster and manage them under one pane of glass; you can group servers and have replication within the nodes that are geographically related.
Where's the best place to look for a forensic breakdown of what actually happened?
The 3rd node was at a satellite site as the main and satellite site are the same organisational unit so we wanted to add it in so that whole unit could be administered and backed up the same way. That remote site has a different subnet and a small amount of latency.
Problem seemed to be identical to the following post:
https://forum.proxmox.com/threads/cluster-issues-after-adding-node.89091/
That post doesn't go anywhere really, it says it broke it and something went wrong and then they removed and re-added it and it worked.
Well, we removed it and the original 3 are now happy again and I don't fancy re-adding it until I know what happened.
But... what happened?
The internet thinks its a corosync confusion issue, but by now that shouldn't be causing the entire thing to explode. We stopped cluster, then corosync then forced to local mode.
We were lucky, that we could get on one of the nodes because the site is a good distance away. It makes sense to me that we can add all parts of an OU to the same cluster and manage them under one pane of glass; you can group servers and have replication within the nodes that are geographically related.
Where's the best place to look for a forensic breakdown of what actually happened?
Last edited: