We are currently running tests on a new three node HA cluster configured with 2x10G networking and shared (Ceph) storage. We are using the latest stable packages for Proxmox 5.3.
The behaviour around reboots of servers is nowhere near as smooth as we would like. These are the issues that we see:
1) Doing a controlled shutdown of a node (from the web interface) seems to shut down (HA) containers and vms alright, but leave them in a 'started' state, meaning that we have to wait for fencing to kick in before they are migrated away and restarted. This causes unnecessary downtime unless we manually migrate away all services before restarting.
2) When shutting down a node, the corresponding node lock in /etc/pve/priv/lock/ is (at first) removed, as part of the shutdown procedure. But as soon as the node is fenced (see above), this lock is reobtained by one of the other nodes. Then, when the original node comes up, HA resources with a preference for this node (via HA groups) are immediately migrated back. In reality, the services in question are stopped on the failover nodes, and the returning node spends two minutes waiting for a lock timeout until it 'successfully acquired lock', and the services are started again.
Any idea why this is happening? Any logs or observations that can offer some insight?
Any help is appreciated.
The behaviour around reboots of servers is nowhere near as smooth as we would like. These are the issues that we see:
1) Doing a controlled shutdown of a node (from the web interface) seems to shut down (HA) containers and vms alright, but leave them in a 'started' state, meaning that we have to wait for fencing to kick in before they are migrated away and restarted. This causes unnecessary downtime unless we manually migrate away all services before restarting.
2) When shutting down a node, the corresponding node lock in /etc/pve/priv/lock/ is (at first) removed, as part of the shutdown procedure. But as soon as the node is fenced (see above), this lock is reobtained by one of the other nodes. Then, when the original node comes up, HA resources with a preference for this node (via HA groups) are immediately migrated back. In reality, the services in question are stopped on the failover nodes, and the returning node spends two minutes waiting for a lock timeout until it 'successfully acquired lock', and the services are started again.
Any idea why this is happening? Any logs or observations that can offer some insight?
Any help is appreciated.