HA: CT was migrated to wrong node on node restart. And it shouldn't be even migrated.

templar

Member
Oct 25, 2021
19
2
23
45
Short: on Node Restart HA started to migrate CT on another node, where it wasn't supposed to move.
Long: I have small cluster out of 5 nodes, I've experimented a little with HA and then disabled HA node affinity rule for later. Today I decided to restart 3rd one because of kernel upgrade. That node had 3 CTs, one of them was inside HA rules. This CT was setup to replicate with nodes 4 and 5, same nodes were setup for HA. But as the node started to restart, the CT was tried to migrate to node 1 (not 4 or 5). Where it was failed to start, because it wasn't replicated there before. I solved it just moving the conf to the 3rd node after restart, so it is not a problem for me now, but I think it's could be important to investigate it, so it wouldn't repeat, when it will be important.

Data: Proxmox v9.1.1
Nodes: 1: mini1, 2: mini2, 3: oldie1, 4: tiny1, 5: tiny2.
CT number 351, hosted on oldie1.
Cluster options, ha: shutdown_policy=failover
HA node alone affinity rule (disabled!):
1764641105757.png
300 and 908 are on another node, I wanted to adapt HA later.

HA Resource properties:
1764641432765.png

Replication rules:
1764641746550.png

Restart logs:
1764641649997.png

I've also attached system log from oldie1 on reboot, if it could be helpful. And can provide further logs if needed.

My solution(worked): mv /etc/pve/nodes/mini1/lxc/351.conf /etc/pve/nodes/oldie1/lxc/
 

Attachments

Last edited:
if your HA affinity rule was disabled, you didn't tell PVE that the CT can only run on those three nodes, so it assumed doing failover to node mini1 is okay..
 
if your HA affinity rule was disabled, you didn't tell PVE that the CT can only run on those three nodes, so it assumed doing failover to node mini1 is okay..
Do I understand you right, it could work if CT had its volumes on network drives only? Then replication should be taken into consideration. I mean, if there is a replication rules, then it's obvious, there are volumes for CT on local drives, which are also needed for running of CT, aren't they? Or it's worth even to check if there are volumes on node's local drives in the CT's configuration.
 
Last edited:
HA will only take the HA settings (including scheduler) into consideration, it doesn't know or care about replication. for "real" HA you need a shared storage, but even then, you need to restrict your HA resources if that storage is not available everywhere. the same is true for "pauper's" HA using ZFS-based replication - you are responsible for setting HA up in a compatible fashion.