Nvidia A6000 vGPU 14.1 Proxmox 7.2.7 **NVRM: Aborting probe for VF 0000:01:00.4 since PF is not bound to nvidia driver

Krony

New Member
Jul 21, 2022
18
0
1
Hello,

My issue is that I cannot see my vGPU's in nvidia-smi, but I can see them in Proxmox GUI and add them to my VM config before I then cannot boot the VM and get "TASK ERROR: pci device '0000:01:00.4' has no available instances of 'nvidia-528'.

Hardware:
CPU(s) 32 x AMD Ryzen Threadripper PRO 3955WX 16-Cores (1 Socket)
Kernel Version
Linux 5.15.39-1-pve #1 SMP PVE 5.15.39-1 (Wed, 22 Jun 2022 17:22:00 +0200)
PVE Manager Version
pve-manager/7.2-7/d0dd0e85
Nvidia A6000

My first question is, should I be using the Linux KVM or Ubuntu version of the Nvidia vGPU 14.1 installer set? I've tried both had similar results with both. I've strictly followed the Proxmox vGPU docs (minimal) and the Nvidia Grid 14.1 Docs. All BIOS options should* be good, the hardware is a bit new ( Supermicro AS-2114GT-DNR)

As I understand it, with this hardware I should be mediating devices https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_virtual_machines_settings and then Proxmox handles SRIOV to the VM's?

After a fresh install of Proxmox and then:

  1. Set up non subscription repositories https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_package_repositories
  2. apt-get update
  3. apt-get dist-upgrade
  4. reboot
  5. apt install build-essential
  6. apt install pve-headers
  7. echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
  8. apt install libvirt-daemon-system
  9. reboot
  10. apt install unzip
  11. Upload nvidia drivers to Proxmox host - scp NVIDIA-GRID-Ubuntu-KVM-510.73.06-510.73.08-512.78.zip root@10.1.2.30:/root
  12. unzip NVIDIA-GRID-Ubuntu-KVM-510.73.06-510.73.08-512.78.zip
  13. sudo apt install ./nvidia-vgpu-ubuntu-510_510.73.06_amd64.deb
  14. /usr/lib/nvidia/sriov-manage -e 00:01:0000.0
  15. cd /sys/class/mdev_bus/0000\:01\:00.4/mdev_supported_types
  16. echo "37a54373-4813-443e-9261-5c0a05ede1ab"> nvidia-528/create
  17. reboot
In the output below you can see that nvidia services are running, mdevctl can see the nvidia-528 (defined) vGPU, so can the kernel, nvidia-smi cannot see it, all the nvidia modules are loaded, and after all that "NVRM: Aborting probe for VF 0000:01:00.4 since PF is not bound to nvidia driver"

Anyone got any ideas? I'm fresh out. Someone please tell me im missing something stooooopid.

TIA


Code:
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# systemctl status nvidia-vgpud.service 
● nvidia-vgpud.service - NVIDIA vGPU Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-vgpud.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Thu 2022-07-21 01:26:30 BST; 6min ago
    Process: 3687 ExecStart=/usr/bin/nvidia-vgpud (code=exited, status=0/SUCCESS)
    Process: 3689 ExecStopPost=/bin/rm -rf /var/run/nvidia-vgpud (code=exited, status=0/SUCCESS)
   Main PID: 3688 (code=exited, status=0/SUCCESS)
        CPU: 103ms

Jul 21 01:26:30 pve nvidia-vgpud[3688]: Number of Displays: 1
Jul 21 01:26:30 pve nvidia-vgpud[3688]: Max pixels: 8847360
Jul 21 01:26:30 pve nvidia-vgpud[3688]: Display: width 4096, height 2160
Jul 21 01:26:30 pve nvidia-vgpud[3688]: GPU Direct supported: 0x1
Jul 21 01:26:30 pve nvidia-vgpud[3688]: NVLink P2P supported: 0x1
Jul 21 01:26:30 pve nvidia-vgpud[3688]: License: NVIDIA-vComputeServer,9.0;Quadro-Virtual-DWS,5.0
Jul 21 01:26:30 pve nvidia-vgpud[3688]: PID file unlocked.
Jul 21 01:26:30 pve nvidia-vgpud[3688]: PID file closed.
Jul 21 01:26:30 pve nvidia-vgpud[3688]: Shutdown (3688)
Jul 21 01:26:30 pve systemd[1]: nvidia-vgpud.service: Succeeded.
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# systemctl status nvidia-vgpu-mgr.service 
● nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-vgpu-mgr.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2022-07-21 01:12:42 BST; 20min ago
    Process: 1006 ExecStart=/usr/bin/nvidia-vgpu-mgr (code=exited, status=0/SUCCESS)
   Main PID: 1010 (nvidia-vgpu-mgr)
      Tasks: 1 (limit: 154345)
     Memory: 532.0K
        CPU: 2.430s
     CGroup: /system.slice/nvidia-vgpu-mgr.service
             └─1010 /usr/bin/nvidia-vgpu-mgr

Jul 21 01:12:42 pve systemd[1]: Starting NVIDIA vGPU Manager Daemon...
Jul 21 01:12:42 pve systemd[1]: Started NVIDIA vGPU Manager Daemon.
Jul 21 01:12:43 pve nvidia-vgpu-mgr[1010]: notice: vmiop_env_log: nvidia-vgpu-mgr daemon started
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# mdevctl list
37a54373-4813-443e-9261-5c0a05ede1ab 0000:01:00.4 nvidia-528 (defined)
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# ls -l /sys/bus/mdev/devices/
total 0
lrwxrwxrwx 1 root root 0 Jul 21 01:15 37a54373-4813-443e-9261-5c0a05ede1ab -> ../../../devices/pci0000:00/0000:00:01.3/0000:01:00.4/37a54373-4813-443e-9261-5c0a05ede1ab
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# nvidia-smi 
Thu Jul 21 01:36:20 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.06    Driver Version: 510.73.06    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    On   | 00000000:01:00.0 Off |                    0 |
| 30%   28C    P8    26W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# nvidia-smi vgpu
Thu Jul 21 01:36:30 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.06              Driver Version: 510.73.06                 |
|---------------------------------+------------------------------+------------+
| GPU  Name                       | Bus-Id                       | GPU-Util   |
|      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
|=================================+==============================+============|
|   0  NVIDIA RTX A6000           | 00000000:01:00.0             |   0%       |
+---------------------------------+------------------------------+------------+
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# lsmod | grep nvidia
nvidia_vgpu_vfio       61440  0
nvidia              39124992  11
mdev                   28672  1 nvidia_vgpu_vfio
vfio                   40960  3 nvidia_vgpu_vfio,vfio_iommu_type1,mdev
drm                   602112  7 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,ttm
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# dmesg | grep -E "NVRM|nvidia"
[    4.031106] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[    4.033842] nvidia 0000:01:00.0: enabling device (0000 -> 0002)
[    4.118372] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  510.73.06  Mon May  9 08:06:24 UTC 2022
[    5.311479] audit: type=1400 audit(1658362362.052:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=992 comm="apparmor_parser"
[    5.311482] audit: type=1400 audit(1658362362.052:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=992 comm="apparmor_parser"
[    5.325582] NVRM: GPU at 0000:01:00.0 has software scheduler DISABLED with policy BEST_EFFORT.
[  122.485737] NVRM: GPU 0000:01:00.0: UnbindLock acquired
[  123.206289] NVRM: Aborting probe for VF 0000:01:00.4 since PF is not bound to nvidia driver.
[  123.206291] nvidia: probe of 0000:01:00.4 failed with error -1
 
i don't see where the actual problem is?
what would you expect differently from nvidia-smi (AFAIK it always only shows the physical card for non MIG type vgpus ?)

if you want to use them in a vm, i'd not create the mdev manually, but let pve handle that
(just select the vf and select the mdev profile in the gui)

does that work?
 
Hey @dcsapak thanks for taking the time.
After a reboot, mdevctl list is blank. I had to re-enable the VF's with:
Code:
root@pve:~# /usr/lib/nvidia/sriov-manage -e 00:01:0000.0
mdevctl list still blank.
Then I checked the status of nvidia-vgpud.service and nvidia-vgpu-mgr.service, alll good.
Checked the VM had nvdidia-528 defined in the gui.

When I started the VM, it actually got a bit further, output below:

swtpm_setup: Starting vTPM manufacturing as root:root @ Thu 21 Jul 2022 09:29:29 AM BST
swtpm_setup: TPM is listening on Unix socket.
swtpm_setup: Successfully created RSA 2048 EK with handle 0x81010001.
swtpm_setup: Invoking /usr/bin/swtpm_localca --type ek --ek 9e5bc03da45fc82a138949a1643a5510745c39590f26e28d23241fdaa514a723ccdefa220b5ff8d881742a97316f199c5a7b05ac7774af143a2e034f7843d1fb90598c6dc8db9dd7004fcd667740ad686b401661ce13451ead3dd1433ae12a97f97a53c4efafa63e08a78fd90cc8fa8c80467fb768c50914b42c17d9bf89b0da4283851831b712528dc9ed60adf31078696b69f04ecbd66d5270c2fba27167d03605ad62edf6d220f20c76359c703445fb32ec6740f41a67850dcba832752097cee6c32bd0e0f391fc3b1a255788f309c6269f5343700c8434dabfbd922e8a71185f49472e921ca108e538a05c77027e17a286e34fd1d13aeb2828f143ce03e9 --dir /tmp/swtpm_setup.certs.OPQEP1 --tpm-spec-family 2.0 --tpm-spec-level 0 --tpm-spec-revision 164 --tpm-manufacturer id:00001014 --tpm-model swtpm --tpm-version id:20191023 --tpm2 --configfile /etc/swtpm-localca.conf --optsfile /etc/swtpm-localca.options
swtpm_setup: swtpm_localca: Creating root CA and a local CA's signing key and issuer cert.
swtpm_setup: swtpm_localca: Successfully created EK certificate locally.
swtpm_setup: Invoking /usr/bin/swtpm_localca --type platform --ek 9e5bc03da45fc82a138949a1643a5510745c39590f26e28d23241fdaa514a723ccdefa220b5ff8d881742a97316f199c5a7b05ac7774af143a2e034f7843d1fb90598c6dc8db9dd7004fcd667740ad686b401661ce13451ead3dd1433ae12a97f97a53c4efafa63e08a78fd90cc8fa8c80467fb768c50914b42c17d9bf89b0da4283851831b712528dc9ed60adf31078696b69f04ecbd66d5270c2fba27167d03605ad62edf6d220f20c76359c703445fb32ec6740f41a67850dcba832752097cee6c32bd0e0f391fc3b1a255788f309c6269f5343700c8434dabfbd922e8a71185f49472e921ca108e538a05c77027e17a286e34fd1d13aeb2828f143ce03e9 --dir /tmp/swtpm_setup.certs.OPQEP1 --tpm-spec-family 2.0 --tpm-spec-level 0 --tpm-spec-revision 164 --tpm-manufacturer id:00001014 --tpm-model swtpm --tpm-version id:20191023 --tpm2 --configfile /etc/swtpm-localca.conf --optsfile /etc/swtpm-localca.options
swtpm_setup: swtpm_localca: Successfully created platform certificate locally.
swtpm_setup: Successfully created NVRAM area 0x1c00002 for RSA 2048 EK certificate.
swtpm_setup: Successfully created NVRAM area 0x1c08000 for platform certificate.
swtpm_setup: Successfully created ECC EK with handle 0x81010016.
swtpm_setup: Invoking /usr/bin/swtpm_localca --type ek --ek x=9af345d35a5918c6b6e8a1a194b97b0893fe932b68e8684f3bacb84c547911e85f3c18f7f7f615b97d805b32ec5f6795,y=03c59b1fdd3bb1f6b85f05125a4c2431d754525c3516fb00aeebad64993d5dc2f98e0dfb86d01a29c1fefd2264f3b8f0,id=secp384r1 --dir /tmp/swtpm_setup.certs.OPQEP1 --tpm-spec-family 2.0 --tpm-spec-level 0 --tpm-spec-revision 164 --tpm-manufacturer id:00001014 --tpm-model swtpm --tpm-version id:20191023 --tpm2 --configfile /etc/swtpm-localca.conf --optsfile /etc/swtpm-localca.options
swtpm_setup: swtpm_localca: Successfully created EK certificate locally.
swtpm_setup: Successfully created NVRAM area 0x1c00016 for ECC EK certificate.
swtpm_setup: Successfully activated PCR banks sha256 among sha1,sha256,sha384,sha512.
swtpm_setup: Successfully authored TPM state.
swtpm_setup: Ending vTPM manufacturing @ Thu 21 Jul 2022 09:29:30 AM BST
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:01:00.4/00000000-0000-0000-0000-000000000100,id=hostpci0,bus=pci.0,addr=0x10: vfio 00000000-0000-0000-0000-000000000100: failed to setup container for group 65: Failed to set iommu for container: Invalid argument
stopping swtpm instance (pid 2216) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1
 
  • Like
Reactions: Moayad and Krony
After another reboot and not enabling the VF's first I got another error when booting the VM. (WIN1021H2 (not installed yet) 1 socket, 16 Cores, 32GB RAM, pc-q35-6.2)

mdev instance '00000000-0000-0000-0000-000000000100' already existed, using it.
swtpm_setup: Not overwriting existing state file.
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:01:00.4/00000000-0000-0000-0000-000000000100,id=hostpci0,bus=pci.0,addr=0x10: vfio 00000000-0000-0000-0000-000000000100: failed to setup container for group 65: Failed to set iommu for container: Invalid argument
stopping swtpm instance (pid 4756) due to QEMU startup error

TASK ERROR: start failed: QEMU exited with code 1
 
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:01:00.4/00000000-0000-0000-0000-000000000100,id=hostpci0,bus=pci.0,addr=0x10: vfio 00000000-0000-0000-0000-000000000100: failed to setup container for group 65: Failed to set iommu for container: Invalid argument
still the same error
 
this is the relevant error, check if you enabled AER in the bios, see https://enterprise-support.nvidia.c...s-BIOS-Settings-for-vGPUs-that-Support-SR-IOV
No dice. you were correct though, both those settings are now enabled in the BIOS. Had to re-enable to VF's then got:
"swtpm_setup: Not overwriting existing state file.
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:01:00.4/00000000-0000-0000-0000-000000000100,id=hostpci0,bus=pci.0,addr=0x10: vfio 00000000-0000-0000-0000-000000000100: failed to get region 1 info: Input/output error
stopping swtpm instance (pid 2725) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1"
 
Last edited:
anything in dmesg when you're trying to start the vm?
also can you please post the complete vm config (qm config ID)
 
dmesg contains some of this:

[ 61.298279] NVRM: GPU at 0000:01:00.0 has software scheduler DISABLED with policy BEST_EFFORT.
[ 62.511254] NVRM: GPU 0000:01:00.0: UnbindLock acquired

[ 63.234939] NVRM: Aborting probe for VF 0000:01:00.4 since PF is not bound to nvidia driver.
[ 63.234941] nvidia: probe of 0000:01:00.4 failed with error -1

[ 231.327292] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: vGPU migration disabled

[ 533.537360] NVRM: 00000000-0000-0000-0000-000000000100 Failed to get bar info: status: 0x57 region_index: 1
[ 533.537364] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: Failed to query region info for region 1. ret: -5
[ 533.537396] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: VFIO IOCTL VFIO_DEVICE_GET_REGION_INFO failed. cmd: 0x3b6c ret: -5


[ 533.785130] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: Failed to post VM shutdown event.
[ 533.785293] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: Failed to unregister notifier.
[ 533.893346] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: stop failed. status: 0x56


root@pve:~# qm config 100

bios: ovmf

boot: order=ide0;ide2;net0;ide1

cores: 16

description: args%3A -uuid 00000000-0000-0000-0000-000000000100

efidisk0: local-lvm:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M

hostpci0: 0000:01:00.4,mdev=nvidia-528,pcie=1,x-vga=1

ide0: local-lvm:vm-100-disk-1,size=150G

ide1: local:iso/virtio-win.iso,media=cdrom,size=519172K

ide2: local:iso/Win10_21H2_EnglishInternational_x64.iso,media=cdrom,size=5748118K

machine: pc-q35-6.2

memory: 32768

meta: creation-qemu=6.2.0,ctime=1658362880

name: WIN10

net0: e1000=0A:07:7F:DE:F1:B2,bridge=vmbr0,firewall=1

numa: 0

ostype: win11

scsihw: virtio-scsi-pci

smbios1: uuid=467fbe0b-3fcc-434f-ac9d-292d4f66f6f4

sockets: 1

tpmstate0: local-lvm:vm-100-disk-2,size=4M,version=v2.0

vmgenid: ca3ce7e5-4fe5-4b43-99ef-361e6ad792d0
 
mhmm... the error messages do not really help (i can't find anything)...
can you post the complete dmesg output? maybe there's some other hint that can further help
 
hi make sure that the 'args: -uuid <UUID>' is there, and start it again
 
  • Like
Reactions: Krony
hi make sure that the 'args: -uuid <UUID>' is there, and start it again
OK, some progress. That seems to have done the trick and I could boot my Win 10 VM and install Windows, but my next issue that after installing the Nvidia 512.78_grid_win10_win11_server2016_server2019_server2022_64bit_international.exe driver, is its bricked the windows install and goes into recovery :( After letting windows remove the driver and a reboot, and then having another go, the VM nows boots, but I get a black screen and I notice RAM usage is 30 of 32GB. I cloned it after re-installing and before adding the nvidia driver, added a vgpu on the next VF and now I do see 2 vGPU's in nvidia-smi... but have managed to brick that one too.

I also have to reboot the whole chassis, as shut down in Proxmox just hangs. VF's need re-enabling after a reboot too. QM config below.

Is maybe the args: -uuid 00000000-0000-0000-0000-000000000100 confusing the two VM's?


root@pve:~# qm config 100


agent: 1


args: -uuid 00000000-0000-0000-0000-000000000100


bios: ovmf


boot: order=ide0;net0;ide2


cores: 8


efidisk0: local-lvm:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M


hostpci0: 0000:01:00.4,mdev=nvidia-528,pcie=1


ide0: local-lvm:vm-100-disk-1,size=150G


ide2: local:iso/virtio-win.iso,media=cdrom,size=519172K


machine: pc-q35-6.2


memory: 32768


meta: creation-qemu=6.2.0,ctime=1658686150


name: WIN10


net0: e1000=C2:BE:C1:BC:81:38,bridge=vmbr0,firewall=1


numa: 0


ostype: win10


scsihw: virtio-scsi-pci


smbios1: uuid=cf0b3cf1-a8fa-4ac2-81d0-9b3acc927c7e


sockets: 1


vga: virtio


vmgenid: ad174d01-237c-456e-80f8-2b1a8209d73b


root@pve:~# qm config 101


agent: 1


args: -uuid 00000000-0000-0000-0000-000000000100


bios: ovmf


boot: order=ide0;net0;ide2


cores: 4


efidisk0: local-lvm:vm-101-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M


hostpci0: 0000:01:00.5,mdev=nvidia-528,pcie=1,x-vga=1


ide0: local-lvm:vm-101-disk-1,size=150G


ide2: local:iso/virtio-win.iso,media=cdrom,size=519172K


machine: pc-q35-6.2


memory: 32768


meta: creation-qemu=6.2.0,ctime=1658686150


name: WIN10CLONE


net0: e1000=66:2F:5B:35:60:09,bridge=vmbr0,firewall=1


numa: 0


ostype: win10


scsihw: virtio-scsi-pci


smbios1: uuid=ee32f875-889f-405c-a938-ff6881779777


sockets: 1


vga: virtio


vmgenid: 94f69e36-8aff-4614-b593-e98d140c8f11
 
Is maybe the args: -uuid 00000000-0000-0000-0000-000000000100 confusing the two VM's?
yeah you have to add the correct uuid, namely we generate them from the vmid and hostpci index, basically we do

<hostpci-index-padded-to-8-chars>-0000-000-000-<vmid-padded-to-12-chars>

so if the mdev is on hostpci1 and vmid 234 the uuid for that vgpu is:
00000001-0000-0000-0000-000000000234

we'll improve that so that we automatically add the uuid in the case of vgpu passthrough on nvidia
(i now have an rtx a5000 here to test, so we can improve the usage of that more easily)
 
  • Like
Reactions: Krony
So, like:

args: -uuid 00000001-0000-0000-0000-000000000100 for 100.conf
args: -uuid 00000001-0000-0000-0000-000000000101 for 101.conf
args: -uuid 00000001-0000-0000-0000-000000000102 for 102.conf
etc?

Still no joy, "paged fault in non paged area" on the VM's

root@pve:/etc/pve/local/qemu-server# lspci | grep NVID
01:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)

Screenshot 2022-07-25 at 15.42.57.png
 
So, like:

args: -uuid 00000001-0000-0000-0000-000000000100 for 100.conf
args: -uuid 00000001-0000-0000-0000-000000000101 for 101.conf
args: -uuid 00000001-0000-0000-0000-000000000102 for 102.conf
etc?
not exactly, as i see you use hostpci0 , in that case the uuids must be
Code:
00000000-0000-0000-0000-000000000100
00000000-0000-0000-0000-000000000101
00000000-0000-0000-0000-000000000102
for vm 100,101,102 respectively

root@pve:/etc/pve/local/qemu-server# lspci | grep NVID
01:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
is that the complete output?
can you post an undedited 'lspci -nn' ?

i have a RTX A5000 here and i have successfully tested windows vms with vgpus...
(we'll post a wiki article in the near future)

anything in the dmesg/journal while the vm bluescreens?
 
Glad to hear its working, I didn't think the A5000 supports vGPU? Did you use the Linux KVM Nvidia Grid bundle (with included windows display driver), or the Ubuntu version?

Dmesg now showing [ 82.772848] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: vGPU migration disabled" for each BSOD'ing VM

lspci -nn and dmesg output attached. Thanks again
 

Attachments

FYI. After installing the Linux KVM grid (after purging the Ubuntu install) on the host and then matching display driver on the Win10 VM, same deal. BSOD and the below dmesg

[ 91.077566] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: vGPU migration disabled
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!