Backup job with several VMs fails when using GPU passthrough

8192K

Member
Apr 12, 2024
30
0
6
I have a backup job set up in my data center that backs up 4 of my VMs to a Proxmox Backup Server. 3 of those 4 VMs are set up to passthrough the same GPU (hence, they can't be run in parallel). This seems to be an issue when backing up as the backup for the first VM with that GPU succeeds but the backups for the other VMs fail as the logs say that the GPU is still in use by the first backed up VM.

This seems like a bug to me. Why would the backup need the GPU or any other PCI or USB device? I am running the latest Proxmox 8.2.2 btw.

Here's an excerpt from the backup log showing the transition from the first successful VM to the second, failing VM (0000:03:00 is the common GPU). Also note that the successful VM had to be sigkilled which might contribute to the issue. I do not know why Sigterm does not have any effect. When booting normally, I can shutdown this Debian VM without issues.
INFO: backup is sparse: 496.21 GiB (96%) total zero data
INFO: backup was done incrementally, reused 512.00 GiB (100%)
INFO: transferred 512.00 GiB in 190 seconds (2.7 GiB/s)
INFO: stopping kvm after backup task
VM quit/powerdown failed - terminating now with SIGTERM
VM still running - terminating now with SIGKILL
INFO: adding notes to backup
INFO: prune older backups with retention: keep-weekly=3
INFO: running 'proxmox-backup-client prune' for 'vm/200'
INFO: pruned 1 backup(s) not covered by keep-retention policy
INFO: Finished Backup of VM 200 (00:03:31)
INFO: Backup finished at 2024-05-29 23:04:06
INFO: Starting Backup of VM 300 (qemu)
INFO: Backup started at 2024-05-29 23:04:06
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: xxx
INFO: include disk 'scsi0' 'consumer-pool:vm-300-disk-0' 250G
INFO: include disk 'efidisk0' 'consumer-pool:vm-300-disk-1' 1M
INFO: include disk 'tpmstate0' 'consumer-pool:vm-300-disk-2' 4M
INFO: creating Proxmox Backup Server archive 'vm/300/2024-05-29T21:04:06Z'
INFO: starting kvm to execute backup task
swtpm_setup: Not overwriting existing state file.
kvm: -device vfio-pci,host=0000:03:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on: vfio 0000:03:00.0: failed to open /dev/vfio/73: Device or resource busy
stopping swtpm instance (pid 138346) due to QEMU startup error
ERROR: Backup of VM 300 failed - start failed: QEMU exited with code 1
INFO: Failed at 2024-05-29 23:04:08

Addition: If I run one of the three VMs with the GPU while the backup job is running, only the VM that does not have the GPU succeeds in creating a backup, which seems like a logical consequence.
 
Last edited:
QEMU does indeed start the VM to create a backup. The guest OS itself isn't booted, but all the resources are activated, including PCI devices or USB passthrough. For example, you can't have a PCI passthrough in two VM, with one fully running and create a backup of the other.

AFAIK, if backups are run sequentially, they should work ok even if they all have the same PCI device configured. Maybe it needs some delay between the backups?
 
Sleeping does not have any effect. The error originates from the shared GPU. I will enclose the full log of the backup job.
It is backing up in this order
- a Debian VM getting passed two Nvidia GPUs ("DebianML")
- a Debian VM getting passed the shared AMD Radeon Pro WX4100 GPU as well as all USB controllers and the Wifi/BT controller ("DebianMain")
- a Windows 11 VM getting passed the same as DebianMain ("Win11")
- a Hackintosh Sonoma VM getting passed the same as DebianMain ("MacSonoma")

As I wrote before, each VM runs flawlessly on its own. However, in the full log, for DebianMain and MacSonoma I get entries like:
kvm: vfio: Cannot reset device 0000:03:00.1, no available reset mechanism.
with 0000:03:00.1 being the audio controller of the WX4100. I have vendor-reset set up and it works when running the VMs by themselves because otherwise the WX4100 would suffer from the AMD reset bug. I assume that the reset bug has something to do with those entries.
Anyways, the backup for the *nix based systems DebianMain and MacSonoma works fine. I get the mentioned entry and the VM has to be sigkilled, but the backups are fine.

What does not work for some reason is the Windows VM in between. For this VM I do not get the "Cannot reset device" message, just a plain "
kvm: -device vfio-pci,host=0000:03:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on: vfio 0000:03:00.0: failed to open /dev/vfio/73: Device or resource busy
stopping swtpm instance (pid 65630) due to QEMU startup error
". 0000.03.00.0 is the WX4100.
The backup process continues with MacSonoma which works.

Here's the full log:
INFO: HOOK: job-init
INFO: starting new backup job: vzdump 100 300 200 400 --notes-template '{{guestname}} {{vmid}}' --mode snapshot --node proxmox --storage PBS --all 0 --prune-backups 'keep-weekly=3' --fleecing 0 --mailnotification failure
INFO: HOOK: job-start
INFO: Starting Backup of VM 100 (qemu)
INFO: Backup started at 2024-05-31 11:06:22
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: DebianML
INFO: include disk 'scsi0' 'consumer-pool:vm-100-disk-0' 32G
INFO: HOOK: backup-start stop 100
INFO: creating Proxmox Backup Server archive 'vm/100/2024-05-31T09:06:22Z'
INFO: starting kvm to execute backup task
INFO: started backup task 'f2569e85-eef7-44dd-9a3b-3719e85a115d'
INFO: scsi0: dirty-bitmap status: created new
INFO: 21% (6.9 GiB of 32.0 GiB) in 3s, read: 2.3 GiB/s, write: 0 B/s
INFO: 38% (12.5 GiB of 32.0 GiB) in 6s, read: 1.9 GiB/s, write: 0 B/s
INFO: 59% (18.9 GiB of 32.0 GiB) in 9s, read: 2.2 GiB/s, write: 0 B/s
INFO: 74% (23.8 GiB of 32.0 GiB) in 12s, read: 1.6 GiB/s, write: 0 B/s
INFO: 91% (29.4 GiB of 32.0 GiB) in 15s, read: 1.9 GiB/s, write: 0 B/s
INFO: 100% (32.0 GiB of 32.0 GiB) in 16s, read: 2.6 GiB/s, write: 0 B/s
INFO: backup is sparse: 12.89 GiB (40%) total zero data
INFO: backup was done incrementally, reused 32.00 GiB (100%)
INFO: transferred 32.00 GiB in 16 seconds (2.0 GiB/s)
INFO: stopping kvm after backup task
INFO: adding notes to backup
INFO: prune older backups with retention: keep-weekly=3
INFO: running 'proxmox-backup-client prune' for 'vm/100'
INFO: pruned 1 backup(s) not covered by keep-retention policy
INFO: HOOK: backup-end stop 100
INFO: Finished Backup of VM 100 (00:00:32)
INFO: Backup finished at 2024-05-31 11:06:54
INFO: HOOK: log-end stop 100
INFO: Starting Backup of VM 200 (qemu)
INFO: Backup started at 2024-05-31 11:06:54
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: DebianMain
INFO: include disk 'scsi0' 'consumer-pool:vm-200-disk-0' 512G
INFO: include disk 'efidisk0' 'consumer-pool:vm-200-disk-1' 1M
INFO: HOOK: backup-start stop 200
INFO: creating Proxmox Backup Server archive 'vm/200/2024-05-31T09:06:54Z'
INFO: starting kvm to execute backup task
kvm: vfio: Cannot reset device 0000:03:00.1, no available reset mechanism.
kvm: vfio: Cannot reset device 0000:03:00.1, no available reset mechanism.

INFO: started backup task 'e06caefd-c50d-4ebb-b234-c08efd370010'
INFO: efidisk0: dirty-bitmap status: created new
INFO: scsi0: dirty-bitmap status: created new
INFO: 2% (12.8 GiB of 512.0 GiB) in 4s, read: 3.2 GiB/s, write: 0 B/s
INFO: 4% (24.8 GiB of 512.0 GiB) in 7s, read: 4.0 GiB/s, write: 0 B/s
....
INFO: 99% (509.2 GiB of 512.0 GiB) in 3m 7s, read: 2.6 GiB/s, write: 0 B/s
INFO: 100% (512.0 GiB of 512.0 GiB) in 3m 9s, read: 1.4 GiB/s, write: 0 B/s
INFO: backup is sparse: 496.21 GiB (96%) total zero data
INFO: backup was done incrementally, reused 512.00 GiB (100%)
INFO: transferred 512.00 GiB in 189 seconds (2.7 GiB/s)
INFO: stopping kvm after backup task
VM quit/powerdown failed - terminating now with SIGTERM
VM still running - terminating now with SIGKILL
INFO: adding notes to backup
INFO: prune older backups with retention: keep-weekly=3
INFO: running 'proxmox-backup-client prune' for 'vm/200'
INFO: pruned 1 backup(s) not covered by keep-retention policy
INFO: HOOK: backup-end stop 200
INFO: Finished Backup of VM 200 (00:03:40)
INFO: Backup finished at 2024-05-31 11:10:34
INFO: HOOK: log-end stop 200
INFO: Starting Backup of VM 300 (qemu)
INFO: Backup started at 2024-05-31 11:10:34
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: Win11
INFO: include disk 'scsi0' 'consumer-pool:vm-300-disk-0' 250G
INFO: include disk 'efidisk0' 'consumer-pool:vm-300-disk-1' 1M
INFO: include disk 'tpmstate0' 'consumer-pool:vm-300-disk-2' 4M
INFO: HOOK: backup-start stop 300
INFO: creating Proxmox Backup Server archive 'vm/300/2024-05-31T09:10:34Z'
INFO: starting kvm to execute backup task
swtpm_setup: Not overwriting existing state file.
kvm: -device vfio-pci,host=0000:03:00.0,id=hostpci0.0,bus=ich9-pcie-port-1,addr=0x0.0,multifunction=on: vfio 0000:03:00.0: failed to open /dev/vfio/73: Device or resource busy
stopping swtpm instance (pid 65630) due to QEMU startup error
ERROR: Backup of VM 300 failed - start failed: QEMU exited with code 1
INFO: Failed at 2024-05-31 11:10:45

INFO: HOOK: backup-abort stop 300
INFO: HOOK: log-end stop 300
INFO: Starting Backup of VM 400 (qemu)
INFO: Backup started at 2024-05-31 11:10:45
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 7
INFO: VM Name: MacSonoma
INFO: include disk 'virtio0' 'consumer-pool:vm-400-disk-0' 256G
INFO: include disk 'efidisk0' 'consumer-pool:vm-400-disk-1' 1M
INFO: HOOK: backup-start stop 400
INFO: creating Proxmox Backup Server archive 'vm/400/2024-05-31T09:10:45Z'
INFO: starting kvm to execute backup task
kvm: warning: host doesn't support requested feature: CPUID.01H:ECX.pcid [bit 17]
....
kvm: warning: host doesn't support requested feature: CPUID.07H:EBX.invpcid [bit 10]
kvm: vfio: Cannot reset device 0000:03:00.1, no available reset mechanism.
kvm: vfio: Cannot reset device 0000:03:00.1, no available reset mechanism.

INFO: started backup task 'e06092eb-7dcf-4e33-a308-d4f08fa3271d'
INFO: efidisk0: dirty-bitmap status: created new
INFO: virtio0: dirty-bitmap status: created new
INFO: 1% (4.7 GiB of 256.0 GiB) in 3s, read: 1.6 GiB/s, write: 0 B/s
INFO: 3% (9.4 GiB of 256.0 GiB) in 6s, read: 1.6 GiB/s, write: 0 B/s
....
INFO: 98% (253.2 GiB of 256.0 GiB) in 1m 41s, read: 2.9 GiB/s, write: 0 B/s
INFO: 100% (256.0 GiB of 256.0 GiB) in 1m 43s, read: 1.4 GiB/s, write: 0 B/s
INFO: backup is sparse: 216.32 GiB (84%) total zero data
INFO: backup was done incrementally, reused 256.00 GiB (100%)
INFO: transferred 256.00 GiB in 103 seconds (2.5 GiB/s)
INFO: stopping kvm after backup task
VM quit/powerdown failed - terminating now with SIGTERM
VM still running - terminating now with SIGKILL
INFO: adding notes to backup
INFO: prune older backups with retention: keep-weekly=3
INFO: running 'proxmox-backup-client prune' for 'vm/400'
INFO: pruned 1 backup(s) not covered by keep-retention policy
INFO: HOOK: backup-end stop 400
INFO: Finished Backup of VM 400 (00:02:14)
INFO: Backup finished at 2024-05-31 11:12:59
INFO: HOOK: log-end stop 400
INFO: HOOK: job-end
INFO: Backup job finished with errors
INFO: notified via target `mail-to-root`
TASK ERROR: job errors

And here are the VMs' configs:
DebianML:
agent: 1
boot: order=scsi0;net0
cores: 16
cpu: x86-64-v2-AES
hostpci1: mapping=GeForce3060_0
hostpci2: mapping=GeForce3060_1
memory: 16384
meta: creation-qemu=8.1.5,ctime=1716465081
name: DebianML
net0: virtio=BC:24:11:44:C3:F9,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: consumer-pool:vm-100-disk-0,iothread=1,size=32G
scsihw: virtio-scsi-single
smbios1: uuid=c78e4312-ffbd-4dd0-a902-e88fa85a564c
sockets: 1
vga: virtio
vmgenid: 4162bf9e-0912-4eb5-b23f-24fadb110f1
DebianMain:
agent: 1
bios: ovmf
boot: order=scsi0;net0
cores: 8
cpu: host
efidisk0: consumer-pool:vm-200-disk-1,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: mapping=WX4100,pcie=1,x-vga=1
hostpci1: mapping=USB_0
hostpci2: mapping=USB_1
hostpci3: mapping=USB_2_C
hostpci4: mapping=USB_3
hostpci5: mapping=USB_4
hostpci6: mapping=WifiBT
machine: q35
memory: 16384
meta: creation-qemu=8.1.5,ctime=1716588052
name: DebianMain
net0: virtio=BC:24:11:AF:FF:42,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: consumer-pool:vm-200-disk-0,cache=writeback,discard=on,iothread=1,size=512G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=bbb50b0d-9a58-43c0-8777-d13f841f01bd
sockets: 1
vga: none
vmgenid: 4e5b2889-b829-4d67-acd5-339bc01adce6
Win11:
agent: 1
bios: ovmf
boot: order=scsi0;net0
cores: 8
cpu: host
efidisk0: consumer-pool:vm-300-disk-1,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: mapping=WX4100,pcie=1,x-vga=1
hostpci1: mapping=USB_0
hostpci2: mapping=USB_1
hostpci3: mapping=USB_2_C
hostpci4: mapping=USB_3
hostpci5: mapping=USB_4
hostpci6: mapping=WifiBT
machine: pc-q35-6.2
memory: 16384
meta: creation-qemu=8.1.5,ctime=1716494978
name: Win11
net0: virtio=BC:24:11:8D:69:C6,bridge=vmbr0,firewall=1
numa: 0
ostype: win11
scsi0: consumer-pool:vm-300-disk-0,cache=writeback,discard=on,size=250G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=816d0745-ea6c-41dd-984b-19b2367e20be
sockets: 1
tpmstate0: consumer-pool:vm-300-disk-2,size=4M,version=v2.0
vga: none
vmgenid: 274a0531-df34-4851-bc6b-e4107c6b9c56
MacSonoma:
args: ...
bios: ovmf
boot: order=virtio0;net0
cores: 16
cpu: x86-64-v2-AES
efidisk0: consumer-pool:vm-400-disk-1,efitype=4m,size=1M
hostpci0: mapping=WX4100,pcie=1,x-vga=1
hostpci1: mapping=USB_0
hostpci2: mapping=USB_1
hostpci3: mapping=USB_2_C
hostpci4: mapping=USB_3
hostpci5: mapping=USB_4
hostpci6: mapping=WifiBT
machine: q35
memory: 16384
meta: creation-qemu=8.1.5,ctime=1716895284
name: MacSonoma
net0: vmxnet3=BC:24:11:18:E2:F4,bridge=vmbr0,firewall=1
numa: 0
ostype: other
scsihw: virtio-scsi-pci
smbios1: uuid=4cc4259e-47f2-4701-a521-966ec08a8ea3
sockets: 1
vga: none
virtio0: consumer-pool:vm-400-disk-0,cache=unsafe,iothread=1,size=256G
vmgenid: eb52aabf-183a-4125-b912-d598d7fadf79

What's the issue? I would like to try to reorder to execution but that is currently not supported by Proxmox.
 
Last edited:
I tried using the newest q35 machine for the Win11 VM - no effect. I then tried not passing the on-board audio controller of the WX4100 - no effect except for the "no available reset mechanism" messages being gone. But Win11 still doesn't backup with the same message and the other VMs have to be sigkilled.
Running a backup for each VM individually works fine btw (still with sigkill, even for Win11).
 
Last edited:
It did not have anything to do with Win11. When I removed this VM from the list, it was the Mac VM backup that failed. It's always the one after the first VM that uses the GPU.
So I simply increased the sleep time in the hook script from 10 seconds to 90. And it works...! Still seems like a bug.
 
Probably QEMU needs some time to fully release the PCI device in order to be used again. I suggest that you open a bug [1], as I think that it should wait until the device is fully released before ending the backup of a VM, so the device is fully ready to be used in any other VM.

Did you try to boot a VM with PCI passthrough, use it for a while, then stop it and inmediately try to start another VM that has the same PCI device configured? That is, emulating what a backup does but fully starting the VM. Curious about if it behaves the same way and you need to wait for a while before being able to start the second VM.

[1] https://bugzilla.proxmox.com/buglist.cgi?component=Qemu&list_id=42365&product=pve&resolution=---
 
Stopping and immediately starting another VM with GPU passthrough works fine, just tested it.
Also, I passed a different GPU to the two VMs, one that does not suffer the AMD reset bug, and the backup still failed. The reason being the same, the GPU was still in use.
Filed a bug report at https://bugzilla.proxmox.com/show_bug.cgi?id=5511.

In case anybody runs into the same issue, here's the simple sleep hook script (30 seconds is enough btw):
Perl:
#!/usr/bin/perl -w

use strict;

print "HOOK: " . join (' ', @ARGV) . "\n";

my $phase = shift;

if ($phase eq 'backup-start') {
   print "sleeping\n";
   sleep(30);
}

exit (0);
And I added it in /etc/vzdump.conf with script: /path/to/script.pl
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!