HA affinity problem (?)

FrancisS

Well-Known Member
Apr 26, 2019
53
7
48
60
Hello,

I have a problem with the HA affinity with PVE 9.1.2, I have two VMs with the resource affinity "Keep Together" and the node affinity "pve1".

When I put the "pve1" in maintenance, one VM migrate to "pve2" and the other to "pve3", not on the same node.

Best regards.
Francis
 
Hi Francis,

working as intended here with 9.1.2.
Both guests are migrated together to node2 when node1 is set to maintenance.

Here are my rules:

Code:
root@node1:~# cat /etc/pve/ha/rules.cfg
resource-affinity: ha-rule-91edaaa7-807b
        affinity positive
        resources vm:100,vm:102

node-affinity: ha-rule-e9b7994a-ecf1
        nodes node1
        resources vm:100,vm:102
        strict 0

If I disable my "Keep Together" rule then I get your behavior with migration to different nodes.

Is rule "enabled"?
Are correct vm ids in rule?

Are all referenced vm ids still HA resources?
You can create a rule and then delete one of HA resources that are referenced by the rule. That will "break" this rule. Re adding the deleted HA resource will not repair the affinity rule. The rule will stay "broken" with a warning sign in Web GUI.

BR
Marcus
 
Last edited:
  • Like
Reactions: waltar
Hello Marcus,

Thank you l have this

node-affinity: ha-rule-57390649-e233
nodes pve1
resources vm:102,vm:104,vm:110
strict 0

resource-affinity: ha-rule-b994d330-5ea0
affinity positive
resources vm:104,vm:110

Best regards.
Francis
 
Last edited:
Marcus,

# ha-manager status
quorum OK
master pve1 (active, Wed Dec 17 14:28:02 2025)
lrm pve1 (active, Wed Dec 17 14:28:03 2025)
lrm pve2 (active, Wed Dec 17 14:28:03 2025)
lrm pve3 (active, Wed Dec 17 14:28:01 2025)
service vm:102 (pve1, started)
service vm:104 (pve1, started)
service vm:110 (pve1, started)

# ha-mnt-on (script)

# ha-manager status
quorum OK
master pve1 (active, Wed Dec 17 14:32:33 2025)
lrm pve1 (maintenance mode, Wed Dec 17 14:32:31 2025)
lrm pve2 (active, Wed Dec 17 14:32:32 2025)
lrm pve3 (active, Wed Dec 17 14:32:31 2025)
service vm:102 (pve1, migrate)
service vm:104 (pve3, starting)
service vm:110 (pve2, started)

I have more VMs.

Best regards.
Francis
 
Hi Francis,

it is still working as expected with 3 guests here on my side.

I have node affinity rule "node1" for id 100, 101, 102 and 2 of them with positive affinity rule id 101,102. Config looks like yours but different ids and hostnames.

All 3 guests running on node1
node1 maintenance mode enable
migration actions:
vm100 -> node2
vm101+102 -> node3

So something strange is going on with your system.

BR
Marcus
 
Hi Marcus,

Probably you have the "chance" that vm101+102 migrate on node3 ???

Is there a way to debug HA "affinity" ?

Best regards.
Francis
 
Last edited:
Hi!

I have a problem with the HA affinity with PVE 9.1.2, I have two VMs with the resource affinity "Keep Together" and the node affinity "pve1".

When I put the "pve1" in maintenance, one VM migrate to "pve2" and the other to "pve3", not on the same node.
I have recreated your exact setup as described by the status output and rules config above and couldn't reproduce this either.

What should happen is that as soon as pve1 is put in maintenance mode, the vm:102 will select a new node (it will be pve2 here as it's empty), vm:104 will select another node (it will be pve3 as it's also empty and the HA CRM goes in a round-robin next-fit fashion with the basic scheduler), and as vm:110 will follow suit with vm:104 to pve3 as these are in a positive resource affinity rule.

As the node affinity rule is non-strict, it will fallback to {pve2, pve3} as the possible nodes for all three. If it were strict, all HA resources would stay on pve1 even though pve1 is in maintenance node.

Can you post the output of journalctl -u pve-ha-crm for the exact situation you posted above? That would help in investigating this issue. Here's a reference what is happening for my test setup:

Code:
adding new service 'vm:102' on node 'pve1'
adding new service 'vm:104' on node 'pve1'
adding new service 'vm:110' on node 'pve1'
service 'vm:102': state changed from 'request_start' to 'started'  (node = pve1)
service 'vm:104': state changed from 'request_start' to 'started'  (node = pve1)
service 'vm:110': state changed from 'request_start' to 'started'  (node = pve1)
status change wait_for_quorum => slave
status change wait_for_quorum => slave
node 'pve1': state changed from 'online' => 'maintenance'
migrate service 'vm:102' to node 'pve2' (running)
service 'vm:102': state changed from 'started' to 'migrate'  (node = pve1, target = pve2)
migrate service 'vm:104' to node 'pve3' (running)
service 'vm:104': state changed from 'started' to 'migrate'  (node = pve1, target = pve3)
migrate service 'vm:110' to node 'pve3' (running)
service 'vm:110': state changed from 'started' to 'migrate'  (node = pve1, target = pve3)
service 'vm:102': state changed from 'migrate' to 'started'  (node = pve2)
service 'vm:104': state changed from 'migrate' to 'started'  (node = pve3)
service 'vm:110': state changed from 'migrate' to 'started'  (node = pve3)
 
Hi Daniel,

Thank you.

Sorry I removed some lines in the logs, I have more than 3 VMs and the nodes (not the real names) are not empty.

I am buzy, I restart a test as soon as possible.

Best regards
Francis
 
Thanks!

If it's possible, it would be great to have a more complete reproducer for this to investigate the issue. The names can be changed, the only important part is that the changed names have the same alphabetical ordering (e.g. SN140 -> pve2, PVE003 -> pve1).
 
Hi Daniel,

Options "Cluster Resource Scheduling" = Default

Ha status with all VMs node renamed.
# ha-manager status
quorum OK
master pve1 (active, Thu Dec 18 12:22:25 2025)
lrm pve1 (active, Thu Dec 18 12:22:19 2025)
lrm pve2 (active, Thu Dec 18 12:22:17 2025)
lrm pve3 (active, Thu Dec 18 12:22:22 2025)
service vm:100 (pve3, started)
service vm:101 (pve2, started)
service vm:102 (pve1, started)
service vm:104 (pve1, started)
service vm:106 (pve2, started)
service vm:108 (pve3, started)
service vm:110 (pve1, started)
service vm:112 (pve3, started)
service vm:113 (pve2, started)

# journalctl -u pve-ha-crm | grep "Dec 18"
Dec 18 12:03:23 pve1 pve-ha-crm[2543]: node 'pve1': state changed from 'online' => 'maintenance'
Dec 18 12:03:23 pve1 pve-ha-crm[2543]: migrate service 'vm:102' to node 'pve2' (running)
Dec 18 12:03:23 pve1 pve-ha-crm[2543]: service 'vm:102': state changed from 'started' to 'migrate' (node = pve1, target = pve2)
Dec 18 12:03:23 pve1 pve-ha-crm[2543]: migrate service 'vm:104' to node 'pve3' (running)
Dec 18 12:03:23 pve1 pve-ha-crm[2543]: service 'vm:104': state changed from 'started' to 'migrate' (node = pve1, target = pve3)
Dec 18 12:03:23 pve1 pve-ha-crm[2543]: migrate service 'vm:110' to node 'pve2' (running)
Dec 18 12:03:23 pve1 pve-ha-crm[2543]: service 'vm:110': state changed from 'started' to 'migrate' (node = pve1, target = pve2)
Dec 18 12:04:04 pve1 pve-ha-crm[2543]: service 'vm:110': state changed from 'migrate' to 'started' (node = pve2)
Dec 18 12:07:04 pve1 pve-ha-crm[2543]: service 'vm:104': state changed from 'migrate' to 'started' (node = pve3)
Dec 18 12:07:44 pve1 pve-ha-crm[2543]: service 'vm:102': state changed from 'migrate' to 'started' (node = pve2)
Dec 18 12:15:54 pve1 pve-ha-crm[2543]: node 'pve1': state changed from 'maintenance' => 'online'
Dec 18 12:15:54 pve1 pve-ha-crm[2543]: moving service 'vm:102' back to 'pve1', node came back from maintenance.
Dec 18 12:15:54 pve1 pve-ha-crm[2543]: migrate service 'vm:102' to node 'pve1' (running)
Dec 18 12:15:54 pve1 pve-ha-crm[2543]: service 'vm:102': state changed from 'started' to 'migrate' (node = pve2, target = pve1)
Dec 18 12:15:54 pve1 pve-ha-crm[2543]: moving service 'vm:104' back to 'pve1', node came back from maintenance.
Dec 18 12:15:54 pve1 pve-ha-crm[2543]: migrate service 'vm:104' to node 'pve1' (running)
Dec 18 12:15:54 pve1 pve-ha-crm[2543]: service 'vm:104': state changed from 'started' to 'migrate' (node = pve3, target = pve1)
Dec 18 12:15:54 pve1 pve-ha-crm[2543]: moving service 'vm:110' back to 'pve1', node came back from maintenance.
Dec 18 12:15:54 pve1 pve-ha-crm[2543]: migrate service 'vm:110' to node 'pve1' (running)
Dec 18 12:15:54 pve1 pve-ha-crm[2543]: service 'vm:110': state changed from 'started' to 'migrate' (node = pve2, target = pve1)
Dec 18 12:16:35 pve1 pve-ha-crm[2543]: service 'vm:110': state changed from 'migrate' to 'started' (node = pve1)
Dec 18 12:19:25 pve1 pve-ha-crm[2543]: service 'vm:104': state changed from 'migrate' to 'started' (node = pve1)
Dec 18 12:20:15 pve1 pve-ha-crm[2543]: service 'vm:102': state changed from 'migrate' to 'started' (node = pve1)

Best regards.
Francis
 
Thanks!

If it's possible, it would be great to have a more complete reproducer for this to investigate the issue. The names can be changed, the only important part is that the changed names have the same alphabetical ordering (e.g. SN140 -> pve2, PVE003 -> pve1).
Only changed the nodes names xxxx1 to pve1, xxxx2 to pve2, etc..
 
Last edited:
Hi Daniel,

Options "Cluster Resource Scheduling" = Default

Ha status with all VMs node renamed.
If the HA status report from above is the exact same as it was before setting pve1 in maintenance mode (the HA status report's timestamp is later than the syslog), then I cannot reproduce it with this configuration either. Are you sure that the rules are complete/not in conflict with other rules?

What is the output of ha-manager rules config? It should show the rules as in use for both.
 
On another note, what pve-ha-manager version is running on the nodes?
 
Daniel,

The command "Ha status" after the test sorry.

# ha-manager rules config
┌─────────┬────────┬───────────────────────┬───────────────────┬──────────────────────┬─────────┐
│ enabled │ state │ rule │ type │ resources │ comment │
╞═════════╪════════╪═══════════════════════╪═══════════════════╪══════════════════════╪═════════╡
│ 1 │ in use │ ha-rule-442d4139-ce6b │ resource-affinity │ vm:106,vm:108 │ │
├─────────┼────────┼───────────────────────┼───────────────────┼──────────────────────┼─────────┤
│ 1 │ in use │ ha-rule-e7796eed-ceac │ node-affinity │ vm:101,vm:113 │ │
├─────────┼────────┼───────────────────────┼───────────────────┼──────────────────────┼─────────┤
│ 1 │ in use │ ha-rule-4e83bf8a-7fcf │ node-affinity │ vm:100,vm:112 │ │
├─────────┼────────┼───────────────────────┼───────────────────┼──────────────────────┼─────────┤
│ 1 │ in use │ ha-rule-57390649-e233 │ node-affinity │ vm:102,vm:104,vm:110 │ │
├─────────┼────────┼───────────────────────┼───────────────────┼──────────────────────┼─────────┤
│ 1 │ in use │ ha-rule-b994d330-5ea0 │ resource-affinity │ vm:104,vm:110 │ │
├─────────┼────────┼───────────────────────┼───────────────────┼──────────────────────┼─────────┤
│ 1 │ in use │ ha-rule-b60a6657-5556 │ node-affinity │ vm:106 │ │
├─────────┼────────┼───────────────────────┼───────────────────┼──────────────────────┼─────────┤
│ 1 │ in use │ ha-rule-c18893ae-450f │ node-affinity │ vm:108 │ │
└─────────┴────────┴───────────────────────┴───────────────────┴──────────────────────┴─────────┘

# for i in 1 2 3 ; do ssh pve$i date; done
Thu Dec 18 02:24:38 PM CET 2025
Thu Dec 18 02:24:38 PM CET 2025
Thu Dec 18 02:24:39 PM CET 2025

# for i in 1 2 3 ; do ssh pve$i dpkg -l pve-ha-manager; done | grep pve-ha-manager
ii pve-ha-manager 5.0.8 amd64 Proxmox VE HA Manager
ii pve-ha-manager 5.0.8 amd64 Proxmox VE HA Manager
ii pve-ha-manager 5.0.8 amd64 Proxmox VE HA Manager
 
I'm sorry, but I cannot reproduce this issue with the information you gave me. There's one newer version of pve-ha-manager on trixie/pve-test (5.1.0), but the bug fix in that version shouldn't be related to this here.

Can you share the resources.cfg and rules.cfg for HA? If it's possible a reinstall of the packages and/or reboot is always a good thing to rule out if there's anything wrong with the source file themselves.
 
Hello Daniel,

Normally the file resources.cfg is updated (Nov 12)?

# ls -l /etc/pve/ha/rules.cfg /etc/pve/ha/resources.cfg
-rw-r----- 1 root www-data 384 Nov 12 08:09 /etc/pve/ha/resources.cfg
-rw-r----- 1 root www-data 605 Dec 11 11:25 /etc/pve/ha/rules.cfg

# cat /etc/pve/ha/rules.cfg
resource-affinity: ha-rule-442d4139-ce6b
affinity negative
resources vm:106,vm:108

node-affinity: ha-rule-e7796eed-ceac
nodes pve2
resources vm:101,vm:113
strict 0

node-affinity: ha-rule-4e83bf8a-7fcf
nodes pve3
resources vm:100,vm:112
strict 0

node-affinity: ha-rule-57390649-e233
nodes pve1
resources vm:102,vm:104,vm:110
strict 0

resource-affinity: ha-rule-b994d330-5ea0
affinity positive
resources vm:104,vm:110

node-affinity: ha-rule-b60a6657-5556
nodes pve2
resources vm:106
strict 0

node-affinity: ha-rule-c18893ae-450f
nodes pve3
resources vm:108
strict 0

The file resources.cfg is and old file ??? with the 9.x I do not have pve*, server1 and server2 groups

# cat /etc/pve/ha/resources.cfg
vm: 101
group pve2
state started

vm: 113
group pve2
max_relocate 3
max_restart 3
state started

vm: 104
group pve1
state started

vm: 100
group pve3
state started

vm: 110
group pve1
state started

vm: 112
group pve3
state started

vm: 102
group pve1
state started

vm: 106
group server1
state started

vm: 108
group server2
state started`

The package "pve-ha-manager" reinstalled on all nodes same problem and no resources.cfg updates

# ls -l /etc/pve/ha/resources.cfg
-rw-r----- 1 root www-data 384 Nov 12 08:09 /etc/pve/ha/resources.cfg

# ha-manager status
quorum OK
master pve1 (active, Fri Dec 19 08:17:04 2025)
lrm pve1 (active, Fri Dec 19 08:17:07 2025)
lrm pve2 (active, Fri Dec 19 08:17:04 2025)
lrm pve3 (active, Fri Dec 19 08:17:04 2025)
service vm:100 (pve3, started)
service vm:101 (pve2, started)
service vm:102 (pve1, started)
service vm:104 (pve1, started)
service vm:106 (pve2, started)
service vm:108 (pve3, started)
service vm:110 (pve1, started)
service vm:112 (pve3, started)
service vm:113 (pve2, started)
# ha-mnt-on
# ha-manager status
quorum OK
master pve1 (active, Fri Dec 19 08:22:55 2025)
lrm pve1 (maintenance mode, Fri Dec 19 08:22:52 2025)
lrm pve2 (active, Fri Dec 19 08:22:55 2025)
lrm pve3 (active, Fri Dec 19 08:22:55 2025)
service vm:100 (pve3, started)
service vm:101 (pve2, started)
service vm:102 (pve2, started)
service vm:104 (pve3, started)
service vm:106 (pve2, started)
service vm:108 (pve3, started)
service vm:110 (pve2, started)
service vm:112 (pve3, started)
service vm:113 (pve2, started)

for all nodes same problem

# ha-mnt-on
# reboot
# ha-mnt-off

Best regards.
Francis
 
Daniel,

I think that the cluster is not in a good state... can I remove the 'resources.cfg' files and restart ha ?

Best regards.
Francis
 
Daniel,

l removed the "resources.cfg" and reconfigure HA (no rules change)

# cat resources.cfg
vm: 100
state started

vm: 101
state started

vm: 102
state started

vm: 104
state started

vm: 106
state started

vm: 108
state started

vm: 110
state started

vm: 112
state started

vm: 113
state started

Now no affinity problem the VMs with affinity "positive" are on the same node pve3.

# ha-manager status
quorum OK
master pve2 (active, Fri Dec 19 09:14:47 2025)
lrm pve1 (maintenance mode, Fri Dec 19 09:14:52 2025)
lrm pve2 (active, Fri Dec 19 09:14:51 2025)
lrm pve3 (active, Fri Dec 19 09:14:50 2025)
service vm:100 (pve3, started)
service vm:101 (pve2, started)
service vm:102 (pve2, started)
service vm:104 (pve3, started)
service vm:106 (pve2, started)
service vm:108 (pve3, started)
service vm:110 (pve3, started)
service vm:112 (pve3, started)
service vm:113 (pve2, started)

Thank you for your help/time

Best regards.
Francis
 
Now no affinity problem the VMs with affinity "positive" are on the same node pve3.
From the old resources.cfg from Nov 12 it seems that the HA groups were never fully migrated. Was the ` at the end an artifact of embedding it as code int he forum or was that part of the file?

Either way, great to hear that your problem has been solved!
 
From the old resources.cfg from Nov 12 it seems that the HA groups were never fully migrated. Was the ` at the end an artifact of embedding it as code int he forum or was that part of the file?
It is not part of the file