Hey Everyone,
We recently had a network failure in one of our data centers. The network failure caused all of the proxmox nodes in our customer to fence themselves. They're back up an running, and the cluster shows all nodes in, but we're having the following issues:
1. HA no longer works. Containers that are managed by HA can't be started. In order to start them we have to remove them from HA.
2. We can't add new nodes to the cluster. When I try to add new nodes I get this response
pvecm add 10.3.16.20
root@10.3.16.20's password:
copy corosync auth key
stopping pve-cluster service
backup old database
Job for corosync.service failed. See 'systemctl status corosync.service' and 'journalctl -xn' for details.
waiting for quorum...
And then it hangs.
Any help would be greatly appreciated.
We recently had a network failure in one of our data centers. The network failure caused all of the proxmox nodes in our customer to fence themselves. They're back up an running, and the cluster shows all nodes in, but we're having the following issues:
1. HA no longer works. Containers that are managed by HA can't be started. In order to start them we have to remove them from HA.
2. We can't add new nodes to the cluster. When I try to add new nodes I get this response
pvecm add 10.3.16.20
root@10.3.16.20's password:
copy corosync auth key
stopping pve-cluster service
backup old database
Job for corosync.service failed. See 'systemctl status corosync.service' and 'journalctl -xn' for details.
waiting for quorum...
And then it hangs.
Any help would be greatly appreciated.