Nvidia vGPU mdev and live migration

Hi, I finally got it working ! Main problem was an missing build option in nvidia vfio kernel driver to enable live migration.


View attachment 40058
1697102052621.png

@spirit hi~ my friend. I have enabled vgpu migration.But when migration I get the error like your picture :
src vm error(I see this print is also included in your picture.)
qemu-system-x86_64: -device vfio-pci,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/426e9111-3eaf-4518-809e-a0bcdaf504dd,display=off,bus=pci.0,addr=0xd: warning: vfio 426e9111-3eaf-4518-809e-a0bcdaf504dd: Could not enable error recovery for the device

dst qemu error:
2023-09-28T01:17:49.245272Z qemu-system-x86_64: get_pci_config_device: Bad config data: i=0x2 read: b8 device: 36 cmask: ff wmask: 0 w1cmask:0
2023-09-28T01:17:49.245291Z qemu-system-x86_64: Failed to load PCIDevice:config
2023-09-28T01:17:49.245294Z qemu-system-x86_64: Failed to load VFIOPCIDevice : pdev
2023-09-28T01:17:49.245297Z qemu-system-x86_64: 3b1f76fa-a3c0-4c87-8962-041f228e6784: Failed to load device config space
2023-09-28T01:17:49.245299Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:0d.0/vfio'
2023-09-28T01:17:49.247864Z qemu-system-x86_64: load of migration failed: Invalid argument
{"timestamp": {"seconds": 1695863869, "microseconds": 247924}, "event": "MIGRATION", "data": {"status": "failed"}}
2023-09-28 01:17:49.299+0000: shutting down, reason=failed

it seems to load the old pci address of vgpu.


I want to know how to configure the mdev device of qemu at the destination. I configured a new mdev device when migrating to the destination, but it doesn't seem to work.

Code:
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
      <source>
        <address uuid='426e9111-3eaf-4518-809e-a0bcdaf504dd'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0d' function='0x0'/>
    </hostdev>


For this kind of migration involving hardware , if I migrate to the destination, I directly replace the src vgpu uuid with the uuid of the destination vgpu. I don’t know if there is a problem if I do this. Not sure if qemu can automatically synchronize vgpu graphics card data to dst.How do you do it.
 
Last edited:
oh great :)

I known somebody with a cluster with telsa cards for testing. I'll try next month to do test again.

@badji ping. Time to test gpu migration again ^_^
@spirit
I'm just starting to look at this now. Promox: 8.1.3, Kernel: 6.5.11-6-pve and Driver 16.1 (535.104.06).
Have you had any luck with it?

I've got two Dell R830's with a Tesla P4 in each one for testing.
 
@spirit
I'm just starting to look at this now. Promox: 8.1.3, Kernel: 6.5.11-6-pve and Driver 16.1 (535.104.06).
Have you had any luck with it?

I've got two Dell R830's with a Tesla P4 in each one for testing.
Not Yet. I don't have the hardware for testing, I need to ask to a friend to have access to his cluster.
 
I just want to contribute to say that I have tested with the GRID 16.2 vGPU driver (535.129.03) and I only had to comment out the lines in /usr/share/perl5/PVE/QemuMigrate.pm to get migration working from the CLI on PVE 8.1:

Code:
#   if (scalar($blocking_resources->@*)) {
#    if ($self->{running} || !$self->{opts}->{force}) {
#        die "can't migrate VM which uses local devices: " . join(", ", $blocking_resources->@*) . "\n";
#    } else {
#        $self->log('info', "migrating VM which uses local devices");
#    }
#   }
 
#    if (scalar($mapped_res->@*)) {
#    my $missing_mappings = $missing_mappings_by_node->{$self->{node}};
#    if ($running) {
#        die "can't migrate running VM which uses mapped devices: " . join(", ", $mapped_res->@*) . "\n";
#    } elsif (scalar($missing_mappings->@*)) {
#        die "can't migrate to '$self->{node}': missing mapped devices " . join(", ", $missing_mappings->@*) . "\n";
#    } else {
#        $self->log('info', "migrating VM which uses mapped local devices");
#    }
#    }
This was between two servers that both had 3 x Nvidia A40 GPUs in them so none of the vGPU hacking with consumer cards and we have plenty of vGPU licenses to test with.

I also created a mapping for all the GPU mdev devices for each server and was surprised to see that I could get migration working between the servers using that too (hence the second commented block above). I have no idea if it was remapping the IDs correctly on migrate - I need to test with more VMs.

The only bit I haven't found is how to disable the GUI migrate warning - "Can't migrate running VM with mapped hostpci0". I also need to test if it will do the right thing and fill up one pGPU before then moving on the assigning the second & third pGPUs to newly powered on or migrated VMs.

Either way, it seems like PVE 8.1 & GRID 16.2 is pretty much ready for vGPU live migration with minimal patching?
 
Either way, it seems like PVE 8.1 & GRID 16.2 is pretty much ready for vGPU live migration with minimal patching?
well this code patches out all safe guards for live migration, which is not what we want (e.g. normal gpus don't support that)

we have to introduce some way of marking devices as migration capable (only the admin really can know if that's supported, there is not really a way to determine that automatically, at least AFAIK)
and then handle such cards specially in the code you commented out

also it might be necessary (for some devices?) to add the 'x-enable-migration' option for the pci device in qemu itself, though if you tested without that, i'm not so sure
 
I agree, a flag to say that a PCI device is migrate-able would be good. Perhaps it belongs alongside the new mdev device mapping feature in PVE 8.1?

Yea, I'm aware of the patch to /usr/share/perl5/PVE/QemuServer/PCI.pm mentioned earlier in this thread to add the x-enable-migration but I found that I could migrate a vGPU VM from the CLI (qm migrate) just fine without it?

Does it matter that I'm using a "mapped" device perhaps? I have created a "A40" mapped device with every mdev device (32) for every pGPU (3) on every host (3) - so 3x3x32=288 entries for the "A40" mapped device in total.

So after a bunch of tests and migrating things around my 3 node cluster, one of the servers got into an odd state such that it was not running any VMs but has stopped reporting any free mdev devices. I'm assuming that's an nvidia driver issue as that tracks mdev usage?
Code:
root@mtlvdi178:~# qm list
      VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID       
       103 mtlws1813            stopped    128000           500.00 0         
root@mtlvdi178:~# nvidia-smi vgpu
Wed Jan 10 07:35:44 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03                |
|---------------------------------+------------------------------+------------+
| GPU  Name                       | Bus-Id                       | GPU-Util   |
|      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
|=================================+==============================+============|
|   0  NVIDIA A40                 | 00000000:21:00.0             |   0%       |
+---------------------------------+------------------------------+------------+
|   1  NVIDIA A40                 | 00000000:81:00.0             |   0%       |
+---------------------------------+------------------------------+------------+
|   2  NVIDIA A40                 | 00000000:E2:00.0             |   0%       |
+---------------------------------+------------------------------+------------+
root@mtlvdi178:~# cat /sys/bus/pci/devices/*/mdev*/nvidia-566/available_instances | uniq -c
     96 0
It was working fine for a while but I have been migrating VMs with 48Q (nvidia-566) vGPUs around and noticed it broke at some point.

Another identical node with nothing running on it:
Code:
root@mtlvdi180:~# cat /sys/bus/pci/devices/*/mdev*/nvidia-566/available_instances | uniq -c
     96 1
I rebooted the server and ran "/usr/lib/nvidia/sriov-manage -e ALL" and everything started working again. Maybe it was a once off... I'll continue testing.
 
So I can reproduce the "mdev not freeing" with every migration now. I just hadn't noticed it before because I was using smaller vGPUs and so it would take more migrations to use them all up. Now if I set a 48Q vGPU and migrate it off a host, that host+pGPU cannot start another VM on it until we remove and readd the VFs:
Code:
root@mtlvdi178:~#  /usr/lib/nvidia/sriov-manage -d ALL
Disabling VFs on 0000:21:00.0
Disabling VFs on 0000:81:00.0
Disabling VFs on 0000:e2:00.0

root@mtlvdi178:~# /usr/lib/nvidia/sriov-manage -e ALL
Enabling VFs on 0000:21:00.0
Enabling VFs on 0000:81:00.0
Enabling VFs on 0000:e2:00.0

root@mtlvdi178:~# cat /sys/bus/pci/devices/*/mdev*/nvidia-566/available_instances | uniq -c
     96 1
I also tried adding the enable-migration=on" (it's no longer experimental) to /usr/share/perl5/PVE/QemuServer/PCI.pm:
Code:
        $devicestr .= ",enable-migration=on,sysfsdev=$sysfspath";
But it didn't really change anything. Migration still works but the mdev is not freed up and made available again on the originating host.

Is it not the case that on newer qemu the "enable-migration=on" option is now the default and hence that is why my testing was working without this patch?

I should also clarify, the mdev is correctly marked as available again if the VM is shutdown - it's only when migrated that the "available_instances" is not being decremented once the VM has moved off.
 
Okay, I see what's missing now - after the migration has completed, we need to remove the mdev device on the origin host. For example after all VMs have been migrated off this host:
Code:
root@mtlvdi178:~# cat /sys/bus/pci/devices/*/mdev*/nvidia-566/available_instances | uniq -c
     32 0
     64 1

root@mtlvdi178:~# echo "1" > /sys/bus/mdev/devices/00000000-0000-0000-0000-000000000101/remove

root@mtlvdi178:~# cat /sys/bus/pci/devices/*/mdev*/nvidia-566/available_instances | uniq -c
     96 1
I guess that is being done correctly for the case where we power off the VM, but that code does not yet exist for live migration. I would think it would be a fairly easy fix?
 
ok, so about the enable-migration it is indeed by default on 'auto' (however they determine that. i'll look into it, maybe we can detect it the same way?)

regarding the release of the mdev:

we initially "freed" the mdev by ourselves during stop/cleanup/etc. but at some point the nvidia driver complained that it couldn't free it itself and got into a very weird state so that the mdev was not visible again, but their internal accounting had none free ...
so we worked around it by waiting some amount of time after cleanup and then try it (if still there)

so maybe the cleanup call is not made when migrating away, but that should not be a problem... i'm not sure when i have time for that though, but hopefully soon
 
Okay great, glad I could help test it anyway.

If you would like me to test anything just let me know. We have a large install base of VDI based on "VMWare for Desktop" so we have pretty good experience with how vGPUs work with that hypervisor stack and it's various features.

We would love the opportunity to move over to something like PVE instead but that will depend on replicating core functionality that we need (but we can drop lots of "nice-to-have" features too).

I am also going to open a conversation with Nvidia to see if they might consider adding Debian + PVE to their KVM "supported" matrix. Things like Ubuntu, RHEV, Openstack etc are already named:

https://docs.nvidia.com/grid/16.0/g...x-kvm/index.html#hypervisor-software-versions

It would just help with any support tickets we ever need to open with them.
 
If you would like me to test anything just let me know. We have a large install base of VDI based on "VMWare for Desktop" so we have pretty good experience with how vGPUs work with that hypervisor stack and it's various features.

We would love the opportunity to move over to something like PVE instead but that will depend on replicating core functionality that we need (but we can drop lots of "nice-to-have" features too).
you could open feature requests/bug reports for the issues you encounter (including the live migration, i think there isn't one yet for that) so we can update whenever we send patches to the devel list
(also it let's us keep better track of these things)

I am also going to open a conversation with Nvidia to see if they might consider adding Debian + PVE to their KVM "supported" matrix. Things like Ubuntu, RHEV, Openstack etc are already named:

https://docs.nvidia.com/grid/16.0/g...x-kvm/index.html#hypervisor-software-versions

It would just help with any support tickets we ever need to open with them.
yes please do!

we tried to contact nvidia several times in the past about that, but were either ghosted after asking a few (in my opinion rather simple) questions or they outright ignored our messages...
if they want to contact us on an offical channel, tell them to use the office@proxmox.com email address for that
 
great, i'll see when i get to implementing the migration (not much time at the moment)


good, it would be nice if you update us if there is news about that
I see that you've been busy working on this. Much appreciated.

I am following along and applied all patches mentioned at: https://lists.proxmox.com/pipermail/pve-devel/2024-March/062226.html

I have three hosts in my cluster, but only two of them have Nvidia Tesla P4's. Will that be a problem with the way the code checks for mappings?
"if (scalar($mapped_res->@*)) {
my $missing_mappings = $missing_mappings_by_node->{$self->{node}};"

I had to comment out that whole section as mentioned earlier, in order to get the migration to start from the CLI.
Otherwise it just showed the normal message "can't migrate running VM which uses mapped devices: hostpci0"

Unfortunately even after commenting it out, it failed.

Migrations work perfectly fine without a vgpu mapped. There is no usage/activity on the test VM, so i'm not sure why it complains of dirty memory pages. The vgpu isn't currently licensed, but I wouldn't imagine that would cause dirty memory. Spent the better part of today trying to get it going without much luck.

Here is the log.
root@DEV2:/usr/share/perl5/PVE# qm migrate 100 DEV3 --online
2024-03-28 20:48:45 use dedicated network address for sending migration traffic (10.11.1.53)
2024-03-28 20:48:46 starting migration of VM 100 to node 'DEV3' (10.11.1.53)
2024-03-28 20:48:46 starting VM 100 on remote node 'DEV3'
2024-03-28 20:48:50 [DEV3] kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000000-0000-0000-0000-000000000100,id=hostpci0,bus=pci.0,addr=0x10: warning: vfio 00000000-0000-0000-0000-000000000100: Could not enable error recovery for the device
2024-03-28 20:48:50 start remote tunnel
2024-03-28 20:48:52 ssh tunnel ver 1
2024-03-28 20:48:52 starting online/live migration on tcp:10.11.1.53:60000
2024-03-28 20:48:52 set migration capabilities
2024-03-28 20:48:52 migration downtime limit: 100 ms
2024-03-28 20:48:52 migration cachesize: 1.0 GiB
2024-03-28 20:48:52 set migration parameters
2024-03-28 20:48:52 start migrate command to tcp:10.11.1.53:60000
2024-03-28 20:48:53 migration active, transferred 92.9 MiB of 8.0 GiB VM-state, 3.8 GiB/s
2024-03-28 20:48:54 migration active, transferred 953.2 MiB of 8.0 GiB VM-state, 1.1 GiB/s
2024-03-28 20:48:55 migration active, transferred 2.1 GiB of 8.0 GiB VM-state, 1.2 GiB/s
2024-03-28 20:48:56 migration active, transferred 3.1 GiB of 8.0 GiB VM-state, 1.2 GiB/s
2024-03-28 20:48:58 migration active, transferred 4.3 GiB of 8.0 GiB VM-state, 124.6 MiB/s
2024-03-28 20:48:58 xbzrle: send updates to 13681 pages in 231.1 KiB encoded memory
2024-03-28 20:48:59 auto-increased downtime to continue migration: 200 ms
2024-03-28 20:48:59 migration active, transferred 4.3 GiB of 8.0 GiB VM-state, 170.3 MiB/s
2024-03-28 20:48:59 xbzrle: send updates to 49462 pages in 856.7 KiB encoded memory, cache-miss 8.65%
2024-03-28 20:49:00 migration active, transferred 4.3 GiB of 8.0 GiB VM-state, 133.2 MiB/s, VM dirties lots of memory: 137.7 MiB/s
2024-03-28 20:49:00 xbzrle: send updates to 82066 pages in 1.4 MiB encoded memory, cache-miss 5.27%
2024-03-28 20:49:00 auto-increased downtime to continue migration: 400 ms
2024-03-28 20:49:01 migration active, transferred 4.3 GiB of 8.0 GiB VM-state, 109.0 MiB/s, VM dirties lots of memory: 112.1 MiB/s
2024-03-28 20:49:01 xbzrle: send updates to 109881 pages in 1.9 MiB encoded memory, cache-miss 15.94%, overflow 1
2024-03-28 20:49:01 auto-increased downtime to continue migration: 800 ms
2024-03-28 20:49:02 migration active, transferred 4.3 GiB of 8.0 GiB VM-state, 113.9 MiB/s, VM dirties lots of memory: 117.4 MiB/s
2024-03-28 20:49:02 xbzrle: send updates to 139437 pages in 2.4 MiB encoded memory, cache-miss 1.53%, overflow 1
2024-03-28 20:49:02 auto-increased downtime to continue migration: 1600 ms
2024-03-28 20:49:03 migration active, transferred 4.4 GiB of 8.0 GiB VM-state, 99.0 MiB/s, VM dirties lots of memory: 103.2 MiB/s
2024-03-28 20:49:03 xbzrle: send updates to 167500 pages in 3.0 MiB encoded memory, cache-miss 2.19%, overflow 1
2024-03-28 20:49:03 auto-increased downtime to continue migration: 3200 ms
2024-03-28 20:49:04 migration active, transferred 4.4 GiB of 8.0 GiB VM-state, 84.9 MiB/s, VM dirties lots of memory: 106.5 MiB/s
2024-03-28 20:49:04 xbzrle: send updates to 192490 pages in 3.4 MiB encoded memory, cache-miss 4.84%, overflow 1
2024-03-28 20:49:04 auto-increased downtime to continue migration: 6400 ms
2024-03-28 20:49:05 migration active, transferred 4.4 GiB of 8.0 GiB VM-state, 102.2 MiB/s
2024-03-28 20:49:05 xbzrle: send updates to 217058 pages in 3.8 MiB encoded memory, cache-miss 5.34%, overflow 2
2024-03-28 20:49:06 auto-increased downtime to continue migration: 12800 ms
2024-03-28 20:49:06 migration active, transferred 4.4 GiB of 8.0 GiB VM-state, 81.5 MiB/s, VM dirties lots of memory: 92.4 MiB/s
2024-03-28 20:49:06 xbzrle: send updates to 237674 pages in 4.2 MiB encoded memory, cache-miss 5.33%, overflow 2
2024-03-28 20:49:07 auto-increased downtime to continue migration: 25600 ms
2024-03-28 20:49:07 migration active, transferred 4.4 GiB of 8.0 GiB VM-state, 75.7 MiB/s
2024-03-28 20:49:07 xbzrle: send updates to 257761 pages in 4.5 MiB encoded memory, cache-miss 1.26%, overflow 2
2024-03-28 20:49:08 auto-increased downtime to continue migration: 51200 ms
2024-03-28 20:49:08 migration status error: failed
2024-03-28 20:49:08 ERROR: online migrate failure - aborting
2024-03-28 20:49:08 aborting phase 2 - cleanup resources
2024-03-28 20:49:08 migrate_cancel
2024-03-28 20:49:22 ERROR: migration finished with problems (duration 00:00:38)
migration problems
 
We have also been testing the patches and everything is working great for us but all our servers are the same (3 x A40) so the mapping is on every server. We can try remove the mapping for one of the servers in the cluster and see what happens...

It is also worth noting that we are using SR-IOV enabled cards and the Pascal (P4/P100) are not? And, I see that GRID 17 (55X) has dropped support for Pascal cards.

The GRID licensing shouldn't make any difference in this case - it will only restrict performance (3fps) AFAIK.
 
I see that you've been busy working on this. Much appreciated.

I am following along and applied all patches mentioned at: https://lists.proxmox.com/pipermail/pve-devel/2024-March/062226.html

I have three hosts in my cluster, but only two of them have Nvidia Tesla P4's. Will that be a problem with the way the code checks for mappings?
"if (scalar($mapped_res->@*)) {
my $missing_mappings = $missing_mappings_by_node->{$self->{node}};"

I had to comment out that whole section as mentioned earlier, in order to get the migration to start from the CLI.
Otherwise it just showed the normal message "can't migrate running VM which uses mapped devices: hostpci0"

Unfortunately even after commenting it out, it failed.

Migrations work perfectly fine without a vgpu mapped. There is no usage/activity on the test VM, so i'm not sure why it complains of dirty memory pages. The vgpu isn't currently licensed, but I wouldn't imagine that would cause dirty memory. Spent the better part of today trying to get it going without much luck.

Here is the log.
root@DEV2:/usr/share/perl5/PVE# qm migrate 100 DEV3 --online
2024-03-28 20:48:45 use dedicated network address for sending migration traffic (10.11.1.53)
2024-03-28 20:48:46 starting migration of VM 100 to node 'DEV3' (10.11.1.53)
2024-03-28 20:48:46 starting VM 100 on remote node 'DEV3'
2024-03-28 20:48:50 [DEV3] kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000000-0000-0000-0000-000000000100,id=hostpci0,bus=pci.0,addr=0x10: warning: vfio 00000000-0000-0000-0000-000000000100: Could not enable error recovery for the device
2024-03-28 20:48:50 start remote tunnel
2024-03-28 20:48:52 ssh tunnel ver 1
2024-03-28 20:48:52 starting online/live migration on tcp:10.11.1.53:60000
2024-03-28 20:48:52 set migration capabilities
2024-03-28 20:48:52 migration downtime limit: 100 ms
2024-03-28 20:48:52 migration cachesize: 1.0 GiB
2024-03-28 20:48:52 set migration parameters
2024-03-28 20:48:52 start migrate command to tcp:10.11.1.53:60000
2024-03-28 20:48:53 migration active, transferred 92.9 MiB of 8.0 GiB VM-state, 3.8 GiB/s
2024-03-28 20:48:54 migration active, transferred 953.2 MiB of 8.0 GiB VM-state, 1.1 GiB/s
2024-03-28 20:48:55 migration active, transferred 2.1 GiB of 8.0 GiB VM-state, 1.2 GiB/s
2024-03-28 20:48:56 migration active, transferred 3.1 GiB of 8.0 GiB VM-state, 1.2 GiB/s
2024-03-28 20:48:58 migration active, transferred 4.3 GiB of 8.0 GiB VM-state, 124.6 MiB/s
2024-03-28 20:48:58 xbzrle: send updates to 13681 pages in 231.1 KiB encoded memory
2024-03-28 20:48:59 auto-increased downtime to continue migration: 200 ms
2024-03-28 20:48:59 migration active, transferred 4.3 GiB of 8.0 GiB VM-state, 170.3 MiB/s
2024-03-28 20:48:59 xbzrle: send updates to 49462 pages in 856.7 KiB encoded memory, cache-miss 8.65%
2024-03-28 20:49:00 migration active, transferred 4.3 GiB of 8.0 GiB VM-state, 133.2 MiB/s, VM dirties lots of memory: 137.7 MiB/s
2024-03-28 20:49:00 xbzrle: send updates to 82066 pages in 1.4 MiB encoded memory, cache-miss 5.27%
2024-03-28 20:49:00 auto-increased downtime to continue migration: 400 ms
2024-03-28 20:49:01 migration active, transferred 4.3 GiB of 8.0 GiB VM-state, 109.0 MiB/s, VM dirties lots of memory: 112.1 MiB/s
2024-03-28 20:49:01 xbzrle: send updates to 109881 pages in 1.9 MiB encoded memory, cache-miss 15.94%, overflow 1
2024-03-28 20:49:01 auto-increased downtime to continue migration: 800 ms
2024-03-28 20:49:02 migration active, transferred 4.3 GiB of 8.0 GiB VM-state, 113.9 MiB/s, VM dirties lots of memory: 117.4 MiB/s
2024-03-28 20:49:02 xbzrle: send updates to 139437 pages in 2.4 MiB encoded memory, cache-miss 1.53%, overflow 1
2024-03-28 20:49:02 auto-increased downtime to continue migration: 1600 ms
2024-03-28 20:49:03 migration active, transferred 4.4 GiB of 8.0 GiB VM-state, 99.0 MiB/s, VM dirties lots of memory: 103.2 MiB/s
2024-03-28 20:49:03 xbzrle: send updates to 167500 pages in 3.0 MiB encoded memory, cache-miss 2.19%, overflow 1
2024-03-28 20:49:03 auto-increased downtime to continue migration: 3200 ms
2024-03-28 20:49:04 migration active, transferred 4.4 GiB of 8.0 GiB VM-state, 84.9 MiB/s, VM dirties lots of memory: 106.5 MiB/s
2024-03-28 20:49:04 xbzrle: send updates to 192490 pages in 3.4 MiB encoded memory, cache-miss 4.84%, overflow 1
2024-03-28 20:49:04 auto-increased downtime to continue migration: 6400 ms
2024-03-28 20:49:05 migration active, transferred 4.4 GiB of 8.0 GiB VM-state, 102.2 MiB/s
2024-03-28 20:49:05 xbzrle: send updates to 217058 pages in 3.8 MiB encoded memory, cache-miss 5.34%, overflow 2
2024-03-28 20:49:06 auto-increased downtime to continue migration: 12800 ms
2024-03-28 20:49:06 migration active, transferred 4.4 GiB of 8.0 GiB VM-state, 81.5 MiB/s, VM dirties lots of memory: 92.4 MiB/s
2024-03-28 20:49:06 xbzrle: send updates to 237674 pages in 4.2 MiB encoded memory, cache-miss 5.33%, overflow 2
2024-03-28 20:49:07 auto-increased downtime to continue migration: 25600 ms
2024-03-28 20:49:07 migration active, transferred 4.4 GiB of 8.0 GiB VM-state, 75.7 MiB/s
2024-03-28 20:49:07 xbzrle: send updates to 257761 pages in 4.5 MiB encoded memory, cache-miss 1.26%, overflow 2
2024-03-28 20:49:08 auto-increased downtime to continue migration: 51200 ms
2024-03-28 20:49:08 migration status error: failed
2024-03-28 20:49:08 ERROR: online migrate failure - aborting
2024-03-28 20:49:08 aborting phase 2 - cleanup resources
2024-03-28 20:49:08 migrate_cancel
2024-03-28 20:49:22 ERROR: migration finished with problems (duration 00:00:38)
migration problems
there are still some bugs with the current code which might affect you , but could you post your mapping config? and the whole task log when using the mapping?

also the syslogs (of both source/target) from the failed migration above would be interesting
 
PCI.cfg
---
Tesla-P4
map id=10de:1bb3,iommugroup=7,node=DEV3,path=0000:83:00.0,subsystem-id=10de:1>
map id=10de:1bb3,iommugroup=7,node=DEV2,path=0000:83:00.0,subsystem-id=10de:1>
live-migration-capable 1
mdev 1
---
SCR-20240405-nyc.png
SCR-20240405-o74.png

VM start log:
kvm: -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/00000000-0000-0000-0000-000000000100,id=hostpci0,bus=pci.0,addr=0x10: warning: vfio 00000000-0000-0000-0000-000000000100: Could not enable error recovery for the device
TASK OK

2024-04-05T17:30:50.625268+10:00 DEV2 pvedaemon[91601]: <root@pam> starting task UPID:DEV2:002A33B0:042B6825:660FA8AA:qmstart:100:root@pam:


2024-04-05T17:30:50.626478+10:00 DEV2 pvedaemon[2765744]: start VM 100: UPID:DEV2:002A33B0:042B6825:660FA8AA:qmstart:100:root@pam:


2024-04-05T17:30:50.787718+10:00 DEV2 kernel: [699535.158952] nvidia-vgpu-vfio 00000000-0000-0000-0000-000000000100: Adding to iommu group 99


2024-04-05T17:30:50.916414+10:00 DEV2 systemd[1]: Started 100.scope.


2024-04-05T17:30:52.431713+10:00 DEV2 kernel: [699536.806097] tap100i0: entered promiscuous mode


2024-04-05T17:30:52.531725+10:00 DEV2 kernel: [699536.903503] vmbr0: port 2(fwpr100p0) entered blocking state


2024-04-05T17:30:52.531757+10:00 DEV2 kernel: [699536.903517] vmbr0: port 2(fwpr100p0) entered disabled state


2024-04-05T17:30:52.531760+10:00 DEV2 kernel: [699536.903569] fwpr100p0: entered allmulticast mode


2024-04-05T17:30:52.531763+10:00 DEV2 kernel: [699536.903780] fwpr100p0: entered promiscuous mode


2024-04-05T17:30:52.531766+10:00 DEV2 kernel: [699536.904426] tg3 0000:01:00.0 eno1: entered promiscuous mode


2024-04-05T17:30:52.531771+10:00 DEV2 kernel: [699536.905168] vmbr0: port 2(fwpr100p0) entered blocking state


2024-04-05T17:30:52.531773+10:00 DEV2 kernel: [699536.905175] vmbr0: port 2(fwpr100p0) entered forwarding state


2024-04-05T17:30:52.687708+10:00 DEV2 kernel: [699537.061221] fwbr100i0: port 1(fwln100i0) entered blocking state


2024-04-05T17:30:52.687742+10:00 DEV2 kernel: [699537.061235] fwbr100i0: port 1(fwln100i0) entered disabled state


2024-04-05T17:30:52.687745+10:00 DEV2 kernel: [699537.061296] fwln100i0: entered allmulticast mode


2024-04-05T17:30:52.687748+10:00 DEV2 kernel: [699537.061407] fwln100i0: entered promiscuous mode


2024-04-05T17:30:52.687751+10:00 DEV2 kernel: [699537.061504] fwbr100i0: port 1(fwln100i0) entered blocking state


2024-04-05T17:30:52.687757+10:00 DEV2 kernel: [699537.061510] fwbr100i0: port 1(fwln100i0) entered forwarding state


2024-04-05T17:30:52.707716+10:00 DEV2 kernel: [699537.081433] fwbr100i0: port 2(tap100i0) entered blocking state


2024-04-05T17:30:52.707733+10:00 DEV2 kernel: [699537.081443] fwbr100i0: port 2(tap100i0) entered disabled state


2024-04-05T17:30:52.707736+10:00 DEV2 kernel: [699537.081490] tap100i0: entered allmulticast mode


2024-04-05T17:30:52.707739+10:00 DEV2 kernel: [699537.081650] fwbr100i0: port 2(tap100i0) entered blocking state


2024-04-05T17:30:52.707741+10:00 DEV2 kernel: [699537.081656] fwbr100i0: port 2(tap100i0) entered forwarding state


2024-04-05T17:30:52.816142+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0


2024-04-05T17:30:52.816328+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 00000000-0000-0000-0>


2024-04-05T17:30:52.817020+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=63


2024-04-05T17:30:52.822177+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_env_log: Successfully updated env symbols!


2024-04-05T17:30:52.826051+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_log: (0x0): gpu-pci-id : 0x8300


2024-04-05T17:30:52.826172+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_log: (0x0): vgpu_type : Quadro


2024-04-05T17:30:52.826263+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_log: (0x0): Framebuffer: 0x38000000


2024-04-05T17:30:52.826328+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_log: (0x0): Virtual Device Id: 0x1bb3:0x1204


2024-04-05T17:30:52.826397+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_log: (0x0): FRL Value: 60 FPS


2024-04-05T17:30:52.826465+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_log: ######## vGPU Manager Information: ########


2024-04-05T17:30:52.826533+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_log: Driver Version: 535.104.06


2024-04-05T17:30:52.827705+10:00 DEV2 kernel: [699537.198741] NVRM: Software scheduler timeslice set to 2083uS.


2024-04-05T17:30:52.831753+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0x120001)


2024-04-05T17:30:52.878398+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...


2024-04-05T17:30:52.955153+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_log: (0x0): vGPU migration enabled


2024-04-05T17:30:52.992553+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_log: (0x0): vGPU manager is running in non-SRIOV mode.


2024-04-05T17:30:52.997675+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_log: display_init inst: 0 successful


2024-04-05T17:30:53.007707+10:00 DEV2 kernel: [699537.380300] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: vGPU migration enabled with upstream V2 migration proto>


2024-04-05T17:30:53.227678+10:00 DEV2 pvedaemon[91601]: <root@pam> end task UPID:DEV2:002A33B0:042B6825:660FA8AA:qmstart:100:root@pam: OK


2024-04-05T17:31:22.586415+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_log: ######## Guest NVIDIA Driver Information: ########


2024-04-05T17:31:22.586808+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_log: Driver Version: 537.70


2024-04-05T17:31:22.586892+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_log: vGPU version: 0x120001


2024-04-05T17:31:22.588776+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_log: (0x0): vGPU license state: Unlicensed (Unrestricted)


2024-04-05T17:37:20.122507+10:00 DEV2 pmxcfs[1702]: [status] notice: received log


2024-04-05T17:37:20.443522+10:00 DEV2 pmxcfs[1702]: [status] notice: received log


2024-04-05T17:37:20.744382+10:00 DEV2 systemd[1]: Started session-229.scope - Session 229 of User root.


2024-04-05T17:38:04.993876+10:00 DEV2 pmxcfs[1702]: [status] notice: received log


2024-04-05T17:40:30.613263+10:00 DEV2 qm[2770118]: <root@pam> starting task UPID:DEV2:002A44DC:042C4AB3:660FAAEE:qmigrate:100:root@pam:


2024-04-05T17:40:35.718619+10:00 DEV2 pmxcfs[1702]: [status] notice: received log


2024-04-05T17:40:38.315843+10:00 DEV2 pmxcfs[1702]: [status] notice: received log


2024-04-05T17:40:55.762604+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_env_log: (0x0): Plugin migration stage change none -> stop_and_copy. QEMU migration state: STOPN>


2024-04-05T17:40:55.810364+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_log: (0x0): Start saving vGPU state ...


2024-04-05T17:40:55.966456+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_env_log: (0x0): Plugin migration stage change stop_and_copy -> none. QEMU migration state: NONE


2024-04-05T17:40:55.966589+10:00 DEV2 nvidia-vgpu-mgr[2765933]: notice: vmiop_log: (0x0): Migration Ended


2024-04-05T17:40:55.971691+10:00 DEV2 kernel: [700140.341867] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: Ignoring transition from STOP to RUNNING state


2024-04-05T17:40:58.148624+10:00 DEV2 pmxcfs[1702]: [status] notice: received log


2024-04-05T17:41:08.184889+10:00 DEV2 pmxcfs[1702]: [status] notice: received log


2024-04-05T17:41:09.227381+10:00 DEV2 qm[2770140]: migration problems


2024-04-05T17:41:09.252188+10:00 DEV2 qm[2770118]: <root@pam> end task UPID:DEV2:002A44DC:042C4AB3:660FAAEE:qmigrate:100:root@pam: migration problems

I've had a look at the Nvidia documentation and although the P4 doesn't suppoort SRIOV, it does support migration.

2.5. vGPU Migration Support

vGPU migration, which includes vMotion and suspend-resume, is supported only on a subset of supported GPUs, VMware vSphere Hypervisor (ESXi) releases, and guest operating systems.
Note: vGPU migration is disabled for a VM for which any of the following NVIDIA CUDA Toolkit features is enabled:
  • Unified memory
  • Debuggers
  • Profilers

Supported GPUs​

  • Tesla M6
  • Tesla M10
  • Tesla M60
  • Tesla P4
  • Tesla P6
  • Tesla P40
  • Tesla V100 SXM2
  • Tesla V100 SXM2 32GB
  • Tesla V100 PCIe
  • Tesla V100 PCIe 32GB
  • Tesla V100S PCIe 32GB
  • Tesla V100 FHHL
  • Tesla T4
  • Quadro RTX 6000
  • Quadro RTX 6000 passive
  • Quadro RTX 8000
  • Quadro RTX 8000 passive
  • NVIDIA A2
  • NVIDIA A10
  • NVIDIA A16
  • NVIDIA A40
  • NVIDIA RTX A5000
  • NVIDIA RTX A5500
  • NVIDIA RTX A6000
  • Since 16.3: NVIDIA L2
  • NVIDIA L4
  • Since 16.3: NVIDIA L20
  • NVIDIA L40
  • Since 16.1: NVIDIA L40S
  • Since 16.1: NVIDIA RTX 5000 Ada
  • NVIDIA RTX 6000 Ada
 
2024-04-05T17:40:34.214650+10:00 DEV3 systemd[1]: Started session-323.scope - Session 323 of User root.
2024-04-05T17:40:35.717408+10:00 DEV3 qm[2866488]: <root@pam> starting task UPID:DEV3:002BBD39:044EA2E9:660FAAF3:qmstart:100:root@pam:
2024-04-05T17:40:35.718014+10:00 DEV3 qm[2866489]: start VM 100: UPID:DEV3:002BBD39:044EA2E9:660FAAF3:qmstart:100:root@pam:
2024-04-05T17:40:35.898306+10:00 DEV3 kernel: [722616.748068] nvidia-vgpu-vfio 00000000-0000-0000-0000-000000000100: Adding to iommu group 97
2024-04-05T17:40:36.047105+10:00 DEV3 systemd[1]: Started 100.scope.
2024-04-05T17:40:37.610297+10:00 DEV3 kernel: [722618.457164] tap100i0: entered promiscuous mode
2024-04-05T17:40:37.726306+10:00 DEV3 kernel: [722618.574823] vmbr0: port 1(fwpr100p0) entered blocking state
2024-04-05T17:40:37.726340+10:00 DEV3 kernel: [722618.574837] vmbr0: port 1(fwpr100p0) entered disabled state
2024-04-05T17:40:37.726345+10:00 DEV3 kernel: [722618.574898] fwpr100p0: entered allmulticast mode
2024-04-05T17:40:37.726349+10:00 DEV3 kernel: [722618.575030] fwpr100p0: entered promiscuous mode
2024-04-05T17:40:37.726353+10:00 DEV3 kernel: [722618.575161] vmbr0: port 1(fwpr100p0) entered blocking state
2024-04-05T17:40:37.726358+10:00 DEV3 kernel: [722618.575166] vmbr0: port 1(fwpr100p0) entered forwarding state
2024-04-05T17:40:37.750308+10:00 DEV3 kernel: [722618.598775] fwbr100i0: port 1(fwln100i0) entered blocking state
2024-04-05T17:40:37.750344+10:00 DEV3 kernel: [722618.598788] fwbr100i0: port 1(fwln100i0) entered disabled state
2024-04-05T17:40:37.750348+10:00 DEV3 kernel: [722618.598843] fwln100i0: entered allmulticast mode
2024-04-05T17:40:37.750350+10:00 DEV3 kernel: [722618.599036] fwln100i0: entered promiscuous mode
2024-04-05T17:40:37.750353+10:00 DEV3 kernel: [722618.599150] fwbr100i0: port 1(fwln100i0) entered blocking state
2024-04-05T17:40:37.750380+10:00 DEV3 kernel: [722618.599156] fwbr100i0: port 1(fwln100i0) entered forwarding state
2024-04-05T17:40:37.774300+10:00 DEV3 kernel: [722618.623041] fwbr100i0: port 2(tap100i0) entered blocking state
2024-04-05T17:40:37.774325+10:00 DEV3 kernel: [722618.623052] fwbr100i0: port 2(tap100i0) entered disabled state
2024-04-05T17:40:37.774329+10:00 DEV3 kernel: [722618.623110] tap100i0: entered allmulticast mode
2024-04-05T17:40:37.774332+10:00 DEV3 kernel: [722618.623298] fwbr100i0: port 2(tap100i0) entered blocking state
2024-04-05T17:40:37.774335+10:00 DEV3 kernel: [722618.623303] fwbr100i0: port 2(tap100i0) entered forwarding state
2024-04-05T17:40:37.887943+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
2024-04-05T17:40:37.888181+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 00000000-0000-0000-0000-000000000100 GPU >
2024-04-05T17:40:37.888703+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=63
2024-04-05T17:40:37.893382+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_env_log: Successfully updated env symbols!
2024-04-05T17:40:37.898304+10:00 DEV3 kernel: [722618.746481] NVRM: Software scheduler timeslice set to 2083uS.
2024-04-05T17:40:37.898499+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: (0x0): gpu-pci-id : 0x8300
2024-04-05T17:40:37.898613+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: (0x0): vgpu_type : Quadro
2024-04-05T17:40:37.898705+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: (0x0): Framebuffer: 0x38000000
2024-04-05T17:40:37.898803+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: (0x0): Virtual Device Id: 0x1bb3:0x1204
2024-04-05T17:40:37.898877+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: (0x0): FRL Value: 60 FPS
2024-04-05T17:40:37.898960+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: ######## vGPU Manager Information: ########
2024-04-05T17:40:37.899046+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: Driver Version: 535.104.06
2024-04-05T17:40:37.899131+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: (0x0): Detected ECC enabled on physical GPU.
2024-04-05T17:40:37.899218+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: (0x0): Guest usable FB size is reduced due to ECC.
2024-04-05T17:40:37.902992+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0x120001)
2024-04-05T17:40:37.948013+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
2024-04-05T17:40:38.022556+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: (0x0): vGPU migration enabled
2024-04-05T17:40:38.058949+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: (0x0): vGPU manager is running in non-SRIOV mode.
2024-04-05T17:40:38.064635+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: display_init inst: 0 successful
2024-04-05T17:40:38.074290+10:00 DEV3 kernel: [722618.922608] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: vGPU migration enabled with upstream V2 migration protocol
2024-04-05T17:40:38.314880+10:00 DEV3 qm[2866488]: <root@pam> end task UPID:DEV3:002BBD39:044EA2E9:660FAAF3:qmstart:100:root@pam: OK
2024-04-05T17:40:38.363846+10:00 DEV3 systemd[1]: session-323.scope: Deactivated successfully.
2024-04-05T17:40:38.364013+10:00 DEV3 systemd[1]: session-323.scope: Consumed 1.809s CPU time.
2024-04-05T17:40:38.922660+10:00 DEV3 systemd[1]: Started session-324.scope - Session 324 of User root.
2024-04-05T17:40:40.375557+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_env_log: (0x0): Plugin migration stage change none -> resume. QEMU migration state: RESUME
2024-04-05T17:40:55.837677+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: (0x0): Start restoring vGPU state ...
2024-04-05T17:40:55.837982+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: Source host driver version: 535.104.06
2024-04-05T17:40:55.958828+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: Assertion Failed at 0x97f0d3c2:2372
2024-04-05T17:40:55.959401+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: 8 frames returned by backtrace
2024-04-05T17:40:55.959548+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(_nv009111vgpu+0x35) [0x7f0397ec5675]
2024-04-05T17:40:55.959666+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(_nv011623vgpu+0x180) [0x7f0397ed4c10]
2024-04-05T17:40:55.959758+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(_nv011633vgpu+0x3e2) [0x7f0397f0d3c2]
2024-04-05T17:40:55.959851+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(_nv004320vgpu+0xcf) [0x7f0397ed6b6f]
2024-04-05T17:40:55.959942+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(_nv011566vgpu+0x4b) [0x7f0397e69adb]
2024-04-05T17:40:55.960033+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: vgpu(+0x167e1) [0x563932c167e1]
2024-04-05T17:40:55.960127+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: /lib/x86_64-linux-gnu/libc.so.6(+0x89044) [0x7f03984d7044]
2024-04-05T17:40:55.960219+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: /lib/x86_64-linux-gnu/libc.so.6(+0x10961c) [0x7f039855761c]
2024-04-05T17:40:55.960312+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: (0x0): Migration Ended
2024-04-05T17:40:55.960408+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_env_log: (0x0): Failed to write device buffer err: 0x1f
2024-04-05T17:40:55.962296+10:00 DEV3 kernel: [722636.810144] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: Failed to write device data -5
2024-04-05T17:40:55.963109+10:00 DEV3 QEMU[2866502]: kvm: error while loading state section id 59(0000:00:10.0/vfio)
2024-04-05T17:40:55.963620+10:00 DEV3 QEMU[2866502]: kvm: load of migration failed: Input/output error
2024-04-05T17:40:56.422270+10:00 DEV3 kernel: [722637.271571] tap100i0: left allmulticast mode
2024-04-05T17:40:56.422302+10:00 DEV3 kernel: [722637.271649] fwbr100i0: port 2(tap100i0) entered disabled state
2024-04-05T17:40:56.458295+10:00 DEV3 kernel: [722637.307397] fwbr100i0: port 1(fwln100i0) entered disabled state
2024-04-05T17:40:56.458317+10:00 DEV3 kernel: [722637.307595] vmbr0: port 1(fwpr100p0) entered disabled state
2024-04-05T17:40:56.458334+10:00 DEV3 kernel: [722637.308154] fwln100i0 (unregistering): left allmulticast mode
2024-04-05T17:40:56.458339+10:00 DEV3 kernel: [722637.308161] fwln100i0 (unregistering): left promiscuous mode
2024-04-05T17:40:56.458343+10:00 DEV3 kernel: [722637.308167] fwbr100i0: port 1(fwln100i0) entered disabled state
2024-04-05T17:40:56.502301+10:00 DEV3 kernel: [722637.348766] fwpr100p0 (unregistering): left allmulticast mode
2024-04-05T17:40:56.502322+10:00 DEV3 kernel: [722637.348774] fwpr100p0 (unregistering): left promiscuous mode
2024-04-05T17:40:56.502328+10:00 DEV3 kernel: [722637.348778] vmbr0: port 1(fwpr100p0) entered disabled state
2024-04-05T17:40:56.618972+10:00 DEV3 systemd[1]: Started session-325.scope - Session 325 of User root.
2024-04-05T17:40:56.856328+10:00 DEV3 systemd[1]: 100.scope: Deactivated successfully.
2024-04-05T17:40:56.856645+10:00 DEV3 systemd[1]: 100.scope: Consumed 6.240s CPU time.
2024-04-05T17:40:58.147517+10:00 DEV3 qm[2866793]: <root@pam> starting task UPID:DEV3:002BBE81:044EABAC:660FAB0A:qmstop:100:root@pam:
2024-04-05T17:40:58.148090+10:00 DEV3 qm[2866817]: stop VM 100: UPID:DEV3:002BBE81:044EABAC:660FAB0A:qmstop:100:root@pam:
2024-04-05T17:41:08.170305+10:00 DEV3 kernel: [722649.017065] nvidia-vgpu-vfio 00000000-0000-0000-0000-000000000100: Removing from iommu group 97
2024-04-05T17:41:08.184098+10:00 DEV3 qm[2866793]: <root@pam> end task UPID:DEV3:002BBE81:044EABAC:660FAB0A:qmstop:100:root@pam: OK
2024-04-05T17:41:08.228716+10:00 DEV3 systemd[1]: session-325.scope: Deactivated successfully.
2024-04-05T17:41:08.228908+10:00 DEV3 systemd[1]: session-325.scope: Consumed 1.556s CPU time.
2024-04-05T17:41:08.266559+10:00 DEV3 systemd[1]: session-324.scope: Deactivated successfully.
2024-04-05T17:41:08.266900+10:00 DEV3 systemd[1]: session-324.scope: Consumed 1.380s CPU time.
2024-04-05T17:41:09.255200+10:00 DEV3 pmxcfs[1723]: [status] notice: received log
2024-04-05T17:41:18.375265+10:00 DEV3 systemd[1]: Stopping user@0.service - User Manager for UID 0...
2024-04-05T17:41:18.376578+10:00 DEV3 systemd[2866427]: Activating special unit exit.target...
2024-04-05T17:41:18.376778+10:00 DEV3 systemd[2866427]: Stopped target default.target - Main User Target.
2024-04-05T17:41:18.376952+10:00 DEV3 systemd[2866427]: Stopped target basic.target - Basic System.
2024-04-05T17:41:18.377081+10:00 DEV3 systemd[2866427]: Stopped target paths.target - Paths.
2024-04-05T17:41:18.377207+10:00 DEV3 systemd[2866427]: Stopped target sockets.target - Sockets.
2024-04-05T17:41:18.377330+10:00 DEV3 systemd[2866427]: Stopped target timers.target - Timers.
2024-04-05T17:41:18.377466+10:00 DEV3 systemd[2866427]: Closed dbus.socket - D-Bus User Message Bus Socket.
2024-04-05T17:41:18.377591+10:00 DEV3 systemd[2866427]: Closed dirmngr.socket - GnuPG network certificate management daemon.
2024-04-05T17:41:18.377721+10:00 DEV3 systemd[2866427]: Closed gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
2024-04-05T17:41:18.377930+10:00 DEV3 systemd[2866427]: Closed gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
2024-04-05T17:41:18.378124+10:00 DEV3 systemd[2866427]: Closed gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
2024-04-05T17:41:18.378375+10:00 DEV3 systemd[2866427]: Closed gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
2024-04-05T17:41:18.378912+10:00 DEV3 systemd[2866427]: Removed slice app.slice - User Application Slice.
2024-04-05T17:41:18.379089+10:00 DEV3 systemd[2866427]: Reached target shutdown.target - Shutdown.
2024-04-05T17:41:18.379239+10:00 DEV3 systemd[2866427]: Finished systemd-exit.service - Exit the Session.
 
2024-04-05T17:40:55.837677+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: (0x0): Start restoring vGPU state ...
2024-04-05T17:40:55.837982+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: Source host driver version: 535.104.06
2024-04-05T17:40:55.958828+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: Assertion Failed at 0x97f0d3c2:2372
2024-04-05T17:40:55.959401+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: 8 frames returned by backtrace
2024-04-05T17:40:55.959548+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(_nv009111vgpu+0x35) [0x7f0397ec5675]
2024-04-05T17:40:55.959666+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(_nv011623vgpu+0x180) [0x7f0397ed4c10]
2024-04-05T17:40:55.959758+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(_nv011633vgpu+0x3e2) [0x7f0397f0d3c2]
2024-04-05T17:40:55.959851+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(_nv004320vgpu+0xcf) [0x7f0397ed6b6f]
2024-04-05T17:40:55.959942+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(_nv011566vgpu+0x4b) [0x7f0397e69adb]
2024-04-05T17:40:55.960033+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: vgpu(+0x167e1) [0x563932c167e1]
2024-04-05T17:40:55.960127+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: /lib/x86_64-linux-gnu/libc.so.6(+0x89044) [0x7f03984d7044]
2024-04-05T17:40:55.960219+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_log: /lib/x86_64-linux-gnu/libc.so.6(+0x10961c) [0x7f039855761c]
2024-04-05T17:40:55.960312+10:00 DEV3 nvidia-vgpu-mgr[2866699]: notice: vmiop_log: (0x0): Migration Ended
2024-04-05T17:40:55.960408+10:00 DEV3 nvidia-vgpu-mgr[2866699]: error: vmiop_env_log: (0x0): Failed to write device buffer err: 0x1f
2024-04-05T17:40:55.962296+10:00 DEV3 kernel: [722636.810144] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: Failed to write device data -5
2024-04-05T17:40:55.963109+10:00 DEV3 QEMU[2866502]: kvm: error while loading state section id 59(0000:00:10.0/vfio)
2024-04-05T17:40:55.963620+10:00 DEV3 QEMU[2866502]: kvm: load of migration failed: Input/output error
this sounds like the error is coming from inside the nvidia driver... it reads like it cannot write the memory into the device buffer?
 
this sounds like the error is coming from inside the nvidia driver... it reads like it cannot write the memory into the device buffer?
Thank you! You were 100% on the money with that. I had foolishy disabled ECC mode for the GPU in the source host, but didn't realise i hadn't on the destination host!

Per nvidia's documentation below, the ECC mode of the source and destination GPU must be same, otherwise migration fails with these errors produced in the syslog:
vmiop_env_log: (0x0): Failed to write device buffer err: 0x1f
[nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: Failed to write device data -5

Heading: 6.3. Migrating a VM Configured with vGPU
https://docs.nvidia.com/grid/16.0/grid-vgpu-user-guide/index.html#migrating-vm-with-vgpu
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!