PCI-E Passthrough HA Maintenance

Jun 28, 2023
19
0
1
Hello Guys,
we know proxmox cant livemigrate VMs with pci-e passthrough devices. Thats not the point. We've played around with HA and Maintenance Mode.
We got 2 questions and 1-2 bugs here.

Case
Lets say we got 3 Servers with each 3 graphic cards, all the same model. We configured PCI-E Ressourcemappings for all this cards to the same ID.
We got 6 VMs. Every VM got an single dedicated card for its own.
We created an HA-Group that includes this servers with state enabled.

Question1
When we enable maintenance mode for an server while VMs with gpu-pass through-enabled are running on this node, proxmox trying to migrate the vms. The migration will be aborted because no livemigration is possible. Why does proxmox trying to migrate? It cant be successful.
Can we change this behaviour so that the vms will be shutdown - migrated - started on another host so that we dont need to intervene manually?

Bug1
After enabling maintenance mode while VMs with gpu are running, proxmox trying to migrate the vms infinite until the maintenancemode will be disabled. Even if in the HA Config "Max Restart/Relocate" is set to 1.


Question2 / Bug
Why does Proxmox migrate VMs with gpu-pass through-enabled to servers without enough ressources while on other servers are ressources available?
Can we change this behaviour?

Example:
Server1 runs VM1, VM2, VM3.
Server2 runs VM4, VM5
Server3 runs VM6
We shutdown VM6 and enable maintenance on Server3. Proxmox migrate VM6 to Server1 and tryed to start the VM.
It fails because there are no GPUs left.

Of course, with the relocate setting we can minimize this behaviour. But there seems to be a lot of room for improvement for the schedulingunit.


Are we missing something here?
Thanks for your help!
 
the general answer to the problems is that the HA stack currently does not have a detailed view about which (local) resources the vm needs to run (this includes passed through devices, but also available storages, etc.)
e.g. a slightly related bug report: https://bugzilla.proxmox.com/show_bug.cgi?id=5156

i also think it would be good if either the ha stack would be aware of such things, or if it could be configured by e.g. ha group or service. that way the ha could does not have to look for the intricate details that are various hardware/resource requirements, but the admin can say : this service can't be live migrated or similiar

in any case, i think 'Bug1' is really a bug and it shouldn't try to migrate indefinetly (so i'd suggest you open one please)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!