Positive Resource Affinity Rules are migrating to target node and back to source node

christian.g

Well-Known Member
Jun 4, 2020
54
33
58
Hello,

we've upgraded our cluster from 8 to 9 and created a few resource affinity rules (no node affinity rules present).

For example there is one positive affinity rule for VMs 910 and 911.

1757668302858.png

Both are simple Linux VMs with backing ceph storage (ssd pool).
No PCI devices or anything else is configured.
Configuration is the same for both VMs.

1757668578985.png

If we migrate 910 from node 1 to node 3, 911 starts migrating too (as expected).

Code:
Requesting HA migration for VM 910 to node hvprx03
also migrate resource 'vm:911' in positive affinity with resource 'vm:910' to target node 'hvprx03'
TASK OK

But as soon as 910 is migrated to node 3 and 911 is still migrating, 910 starts migrating back to node 1, which causes 911 to also migrate back to node 1 as soon as migration to node 3 is finished.
No errors reported.

1757669571575.png

Any idea whats going on?
 

Attachments

  • 1757668259386.png
    1757668259386.png
    52 KB · Views: 7
Last edited:
Hi!

I couldn't fully reproduce the issue yet, but from what I can see there is a little difference in migration times for VM 910 (25 secs) and VM 911 (45 secs), which might cause the HA Manager to detect a separation between those guests inbetween and request that VM 910 should be put pack where VM 911 was (node1) - 910 is migrated back to hvprx01 at 10:54:52, right before VM 911 is finished with migration at 10:55:05.

That's incorrect even for the strict case (MUST be together on a node) and the HA Manager should not select the origin node as the node where the HA resource is located for positive resource affinity rules, but the target node, while migrating. I've opened a Bugzilla entry [0] to track this.

[0] https://bugzilla.proxmox.com/show_bug.cgi?id=6801
 
Thx for opening a Bugreport.

Well yes, VMs may need a different amount of time to migrate. That's completely legit.

Consider an Application Server and a corresponding Database Server VM. You want them on the same node. The Database Server VM may take even minutes longer to migrate (RAM is usually way bigger on Database Server and becomes dirty a lot). This is the case on our cluster and made that behavior visible to me in first place, since the Database Server VM becomes slow during migration and users started complaining that it's slow for a long time (> 15 minutes). This made me have a look at the current state an i observed the Application Server VM migrated back to the source node and the Database VM is now migrating back too.

From may point of view, HA manager should not (in any case) consider affinity rules for VMs of an affinity rule group as long as one of them is in migration state.
 
Last edited: