Nvidia A6000 vGPU 14.1 Proxmox 7.2.7 **NVRM: Aborting probe for VF 0000:01:00.4 since PF is not bound to nvidia driver

Krony · Jul 21, 2022

Hello,

My issue is that I cannot see my vGPU's in nvidia-smi, but I can see them in Proxmox GUI and add them to my VM config before I then cannot boot the VM and get "TASK ERROR: pci device '0000:01:00.4' has no available instances of 'nvidia-528'.

Hardware:

CPU(s) 32 x AMD Ryzen Threadripper PRO 3955WX 16-Cores (1 Socket)

Kernel Version
Linux 5.15.39-1-pve #1 SMP PVE 5.15.39-1 (Wed, 22 Jun 2022 17:22:00 +0200)

PVE Manager Version
pve-manager/7.2-7/d0dd0e85

Nvidia A6000

My first question is, should I be using the Linux KVM or Ubuntu version of the Nvidia vGPU 14.1 installer set? I've tried both had similar results with both. I've strictly followed the Proxmox vGPU docs (minimal) and the Nvidia Grid 14.1 Docs. All BIOS options should* be good, the hardware is a bit new ( Supermicro AS-2114GT-DNR)

As I understand it, with this hardware I should be mediating devices https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_virtual_machines_settings and then Proxmox handles SRIOV to the VM's?

After a fresh install of Proxmox and then:

Set up non subscription repositories https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysadmin_package_repositories
apt-get update
apt-get dist-upgrade
reboot
apt install build-essential
apt install pve-headers
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
apt install libvirt-daemon-system
reboot
apt install unzip
Upload nvidia drivers to Proxmox host - scp NVIDIA-GRID-Ubuntu-KVM-510.73.06-510.73.08-512.78.zip root@10.1.2.30:/root
unzip NVIDIA-GRID-Ubuntu-KVM-510.73.06-510.73.08-512.78.zip
sudo apt install ./nvidia-vgpu-ubuntu-510_510.73.06_amd64.deb
/usr/lib/nvidia/sriov-manage -e 00:01:0000.0
cd /sys/class/mdev_bus/0000\:01\:00.4/mdev_supported_types
echo "37a54373-4813-443e-9261-5c0a05ede1ab"> nvidia-528/create
reboot

In the output below you can see that nvidia services are running, mdevctl can see the nvidia-528 (defined) vGPU, so can the kernel, nvidia-smi cannot see it, all the nvidia modules are loaded, and after all that "NVRM: Aborting probe for VF 0000:01:00.4 since PF is not bound to nvidia driver"

Anyone got any ideas? I'm fresh out. Someone please tell me im missing something stooooopid.

TIA

Code:

root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# systemctl status nvidia-vgpud.service 
● nvidia-vgpud.service - NVIDIA vGPU Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-vgpud.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Thu 2022-07-21 01:26:30 BST; 6min ago
    Process: 3687 ExecStart=/usr/bin/nvidia-vgpud (code=exited, status=0/SUCCESS)
    Process: 3689 ExecStopPost=/bin/rm -rf /var/run/nvidia-vgpud (code=exited, status=0/SUCCESS)
   Main PID: 3688 (code=exited, status=0/SUCCESS)
        CPU: 103ms

Jul 21 01:26:30 pve nvidia-vgpud[3688]: Number of Displays: 1
Jul 21 01:26:30 pve nvidia-vgpud[3688]: Max pixels: 8847360
Jul 21 01:26:30 pve nvidia-vgpud[3688]: Display: width 4096, height 2160
Jul 21 01:26:30 pve nvidia-vgpud[3688]: GPU Direct supported: 0x1
Jul 21 01:26:30 pve nvidia-vgpud[3688]: NVLink P2P supported: 0x1
Jul 21 01:26:30 pve nvidia-vgpud[3688]: License: NVIDIA-vComputeServer,9.0;Quadro-Virtual-DWS,5.0
Jul 21 01:26:30 pve nvidia-vgpud[3688]: PID file unlocked.
Jul 21 01:26:30 pve nvidia-vgpud[3688]: PID file closed.
Jul 21 01:26:30 pve nvidia-vgpud[3688]: Shutdown (3688)
Jul 21 01:26:30 pve systemd[1]: nvidia-vgpud.service: Succeeded.
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# systemctl status nvidia-vgpu-mgr.service 
● nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-vgpu-mgr.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2022-07-21 01:12:42 BST; 20min ago
    Process: 1006 ExecStart=/usr/bin/nvidia-vgpu-mgr (code=exited, status=0/SUCCESS)
   Main PID: 1010 (nvidia-vgpu-mgr)
      Tasks: 1 (limit: 154345)
     Memory: 532.0K
        CPU: 2.430s
     CGroup: /system.slice/nvidia-vgpu-mgr.service
             └─1010 /usr/bin/nvidia-vgpu-mgr

Jul 21 01:12:42 pve systemd[1]: Starting NVIDIA vGPU Manager Daemon...
Jul 21 01:12:42 pve systemd[1]: Started NVIDIA vGPU Manager Daemon.
Jul 21 01:12:43 pve nvidia-vgpu-mgr[1010]: notice: vmiop_env_log: nvidia-vgpu-mgr daemon started
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# mdevctl list
37a54373-4813-443e-9261-5c0a05ede1ab 0000:01:00.4 nvidia-528 (defined)
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# ls -l /sys/bus/mdev/devices/
total 0
lrwxrwxrwx 1 root root 0 Jul 21 01:15 37a54373-4813-443e-9261-5c0a05ede1ab -> ../../../devices/pci0000:00/0000:00:01.3/0000:01:00.4/37a54373-4813-443e-9261-5c0a05ede1ab
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# nvidia-smi 
Thu Jul 21 01:36:20 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.06    Driver Version: 510.73.06    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    On   | 00000000:01:00.0 Off |                    0 |
| 30%   28C    P8    26W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# nvidia-smi vgpu
Thu Jul 21 01:36:30 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.06              Driver Version: 510.73.06                 |
|---------------------------------+------------------------------+------------+
| GPU  Name                       | Bus-Id                       | GPU-Util   |
|      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
|=================================+==============================+============|
|   0  NVIDIA RTX A6000           | 00000000:01:00.0             |   0%       |
+---------------------------------+------------------------------+------------+
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# lsmod | grep nvidia
nvidia_vgpu_vfio       61440  0
nvidia              39124992  11
mdev                   28672  1 nvidia_vgpu_vfio
vfio                   40960  3 nvidia_vgpu_vfio,vfio_iommu_type1,mdev
drm                   602112  7 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,ttm
root@pve:/sys/class/mdev_bus/0000:01:00.4/mdev_supported_types# dmesg | grep -E "NVRM|nvidia"
[    4.031106] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[    4.033842] nvidia 0000:01:00.0: enabling device (0000 -> 0002)
[    4.118372] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  510.73.06  Mon May  9 08:06:24 UTC 2022
[    5.311479] audit: type=1400 audit(1658362362.052:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=992 comm="apparmor_parser"
[    5.311482] audit: type=1400 audit(1658362362.052:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=992 comm="apparmor_parser"
[    5.325582] NVRM: GPU at 0000:01:00.0 has software scheduler DISABLED with policy BEST_EFFORT.
[  122.485737] NVRM: GPU 0000:01:00.0: UnbindLock acquired
[  123.206289] NVRM: Aborting probe for VF 0000:01:00.4 since PF is not bound to nvidia driver.
[  123.206291] nvidia: probe of 0000:01:00.4 failed with error -1

dcsapak · Jul 21, 2022

i don't see where the actual problem is?
what would you expect differently from nvidia-smi (AFAIK it always only shows the physical card for non MIG type vgpus ?)

if you want to use them in a vm, i'd not create the mdev manually, but let pve handle that
(just select the vf and select the mdev profile in the gui)

does that work?

Krony · Jul 21, 2022

Hey @dcsapak thanks for taking the time.
After a reboot, mdevctl list is blank. I had to re-enable the VF's with:

Code:

root@pve:~# /usr/lib/nvidia/sriov-manage -e 00:01:0000.0

mdevctl list still blank.
Then I checked the status of nvidia-vgpud.service and nvidia-vgpu-mgr.service, alll good.
Checked the VM had nvdidia-528 defined in the gui.

When I started the VM, it actually got a bit further, output below:

swtpm_setup: Starting vTPM manufacturing as root:root @ Thu 21 Jul 2022 09:29:29 AM BST
swtpm_setup: TPM is listening on Unix socket.
swtpm_setup: Successfully created RSA 2048 EK with handle 0x81010001.
swtpm_setup: Invoking /usr/bin/swtpm_localca --type ek --ek 9e5bc03da45fc82a138949a1643a5510745c39590f26e28d23241fdaa514a723ccdefa220b5ff8d881742a97316f199c5a7b05ac7774af143a2e034f7843d1fb90598c6dc8db9dd7004fcd667740ad686b401661ce13451ead3dd1433ae12a97f97a53c4efafa63e08a78fd90cc8fa8c80467fb768c50914b42c17d9bf89b0da4283851831b712528dc9ed60adf31078696b69f04ecbd66d5270c2fba27167d03605ad62edf6d220f20c76359c703445fb32ec6740f41a67850dcba832752097cee6c32bd0e0f391fc3b1a255788f309c6269f5343700c8434dabfbd922e8a71185f49472e921ca108e538a05c77027e17a286e34fd1d13aeb2828f143ce03e9 --dir /tmp/swtpm_setup.certs.OPQEP1 --tpm-spec-family 2.0 --tpm-spec-level 0 --tpm-spec-revision 164 --tpm-manufacturer id:00001014 --tpm-model swtpm --tpm-version id:20191023 --tpm2 --configfile /etc/swtpm-localca.conf --optsfile /etc/swtpm-localca.options
swtpm_setup: swtpm_localca: Creating root CA and a local CA's signing key and issuer cert.
swtpm_setup: swtpm_localca: Successfully created EK certificate locally.
swtpm_setup: Invoking /usr/bin/swtpm_localca --type platform --ek 9e5bc03da45fc82a138949a1643a5510745c39590f26e28d23241fdaa514a723ccdefa220b5ff8d881742a97316f199c5a7b05ac7774af143a2e034f7843d1fb90598c6dc8db9dd7004fcd667740ad686b401661ce13451ead3dd1433ae12a97f97a53c4efafa63e08a78fd90cc8fa8c80467fb768c50914b42c17d9bf89b0da4283851831b712528dc9ed60adf31078696b69f04ecbd66d5270c2fba27167d03605ad62edf6d220f20c76359c703445fb32ec6740f41a67850dcba832752097cee6c32bd0e0f391fc3b1a255788f309c6269f5343700c8434dabfbd922e8a71185f49472e921ca108e538a05c77027e17a286e34fd1d13aeb2828f143ce03e9 --dir /tmp/swtpm_setup.certs.OPQEP1 --tpm-spec-family 2.0 --tpm-spec-level 0 --tpm-spec-revision 164 --tpm-manufacturer id:00001014 --tpm-model swtpm --tpm-version id:20191023 --tpm2 --configfile /etc/swtpm-localca.conf --optsfile /etc/swtpm-localca.options
swtpm_setup: swtpm_localca: Successfully created platform certificate locally.
swtpm_setup: Successfully created NVRAM area 0x1c00002 for RSA 2048 EK certificate.
swtpm_setup: Successfully created NVRAM area 0x1c08000 for platform certificate.
swtpm_setup: Successfully created ECC EK with handle 0x81010016.
swtpm_setup: Invoking /usr/bin/swtpm_localca --type ek --ek x=9af345d35a5918c6b6e8a1a194b97b0893fe932b68e8684f3bacb84c547911e85f3c18f7f7f615b97d805b32ec5f6795,y=03c59b1fdd3bb1f6b85f05125a4c2431d754525c3516fb00aeebad64993d5dc2f98e0dfb86d01a29c1fefd2264f3b8f0,id=secp384r1 --dir /tmp/swtpm_setup.certs.OPQEP1 --tpm-spec-family 2.0 --tpm-spec-level 0 --tpm-spec-revision 164 --tpm-manufacturer id:00001014 --tpm-model swtpm --tpm-version id:20191023 --tpm2 --configfile /etc/swtpm-localca.conf --optsfile /etc/swtpm-localca.options
swtpm_setup: swtpm_localca: Successfully created EK certificate locally.
swtpm_setup: Successfully created NVRAM area 0x1c00016 for ECC EK certificate.
swtpm_setup: Successfully activated PCR banks sha256 among sha1,sha256,sha384,sha512.
swtpm_setup: Successfully authored TPM state.
swtpm_setup: Ending vTPM manufacturing @ Thu 21 Jul 2022 09:29:30 AM BST
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:01:00.4/00000000-0000-0000-0000-000000000100,id=hostpci0,bus=pci.0,addr=0x10: vfio 00000000-0000-0000-0000-000000000100: failed to setup container for group 65: Failed to set iommu for container: Invalid argument
stopping swtpm instance (pid 2216) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1

dcsapak · Jul 21, 2022

Krony said:
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:01:00.4/00000000-0000-0000-0000-000000000100,id=hostpci0,bus=pci.0,addr=0x10: vfio 00000000-0000-0000-0000-000000000100: failed to setup container for group 65: Failed to set iommu for container: Invalid argument

this is the relevant error, check if you enabled AER in the bios, see https://enterprise-support.nvidia.c...s-BIOS-Settings-for-vGPUs-that-Support-SR-IOV

Krony · Jul 21, 2022

After another reboot and not enabling the VF's first I got another error when booting the VM. (WIN1021H2 (not installed yet) 1 socket, 16 Cores, 32GB RAM, pc-q35-6.2)

mdev instance '00000000-0000-0000-0000-000000000100' already existed, using it.
swtpm_setup: Not overwriting existing state file.
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:01:00.4/00000000-0000-0000-0000-000000000100,id=hostpci0,bus=pci.0,addr=0x10: vfio 00000000-0000-0000-0000-000000000100: failed to setup container for group 65: Failed to set iommu for container: Invalid argument
stopping swtpm instance (pid 4756) due to QEMU startup error

TASK ERROR: start failed: QEMU exited with code 1

dcsapak · Jul 21, 2022

Krony said:
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:01:00.4/00000000-0000-0000-0000-000000000100,id=hostpci0,bus=pci.0,addr=0x10: vfio 00000000-0000-0000-0000-000000000100: failed to setup container for group 65: Failed to set iommu for container: Invalid argument

still the same error

Krony · Jul 21, 2022

dcsapak said:
still the same error

Yep, our posts are slightly out of sync, I hadn't seen your reply. Just having the BIOS checked on site now. Thanks again

Krony · Jul 21, 2022

dcsapak said:
this is the relevant error, check if you enabled AER in the bios, see https://enterprise-support.nvidia.c...s-BIOS-Settings-for-vGPUs-that-Support-SR-IOV

No dice. you were correct though, both those settings are now enabled in the BIOS. Had to re-enable to VF's then got:
"swtpm_setup: Not overwriting existing state file.
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:01:00.4/00000000-0000-0000-0000-000000000100,id=hostpci0,bus=pci.0,addr=0x10: vfio 00000000-0000-0000-0000-000000000100: failed to get region 1 info: Input/output error
stopping swtpm instance (pid 2725) due to QEMU startup error
TASK ERROR: start failed: QEMU exited with code 1"

dcsapak · Jul 21, 2022

anything in dmesg when you're trying to start the vm?
also can you please post the complete vm config (qm config ID)

Krony · Jul 21, 2022

dmesg contains some of this:

[ 61.298279] NVRM: GPU at 0000:01:00.0 has software scheduler DISABLED with policy BEST_EFFORT.
[ 62.511254] NVRM: GPU 0000:01:00.0: UnbindLock acquired

[ 63.234939] NVRM: Aborting probe for VF 0000:01:00.4 since PF is not bound to nvidia driver.
[ 63.234941] nvidia: probe of 0000:01:00.4 failed with error -1

[ 231.327292] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: vGPU migration disabled

[ 533.537360] NVRM: 00000000-0000-0000-0000-000000000100 Failed to get bar info: status: 0x57 region_index: 1
[ 533.537364] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: Failed to query region info for region 1. ret: -5
[ 533.537396] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: VFIO IOCTL VFIO_DEVICE_GET_REGION_INFO failed. cmd: 0x3b6c ret: -5

[ 533.785130] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: Failed to post VM shutdown event.
[ 533.785293] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: Failed to unregister notifier.
[ 533.893346] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: stop failed. status: 0x56

root@pve:~# qm config 100

bios: ovmf

boot: order=ide0;ide2;net0;ide1

cores: 16

description: args%3A -uuid 00000000-0000-0000-0000-000000000100

efidisk0: local-lvm:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M

hostpci0: 0000:01:00.4,mdev=nvidia-528,pcie=1,x-vga=1

ide0: local-lvm:vm-100-disk-1,size=150G

ide1: local:iso/virtio-win.iso,media=cdrom,size=519172K

ide2: local:iso/Win10_21H2_EnglishInternational_x64.iso,media=cdrom,size=5748118K

machine: pc-q35-6.2

memory: 32768

meta: creation-qemu=6.2.0,ctime=1658362880

name: WIN10

net0: e1000=0A:07:7F

E:F1:B2,bridge=vmbr0,firewall=1

numa: 0

ostype: win11

scsihw: virtio-scsi-pci

smbios1: uuid=467fbe0b-3fcc-434f-ac9d-292d4f66f6f4

sockets: 1

tpmstate0: local-lvm:vm-100-disk-2,size=4M,version=v2.0

vmgenid: ca3ce7e5-4fe5-4b43-99ef-361e6ad792d0

Krony · Jul 21, 2022

I've also tried step 5 here but got the same error and hashed it out.

dcsapak · Jul 22, 2022

mhmm... the error messages do not really help (i can't find anything)...
can you post the complete dmesg output? maybe there's some other hint that can further help

Krony · Jul 22, 2022

Sure thing, thanks again for the help.

dcsapak · Jul 22, 2022

hi make sure that the 'args: -uuid <UUID>' is there, and start it again

Krony · Jul 25, 2022

dcsapak said:
hi make sure that the 'args: -uuid <UUID>' is there, and start it again

OK, some progress. That seems to have done the trick and I could boot my Win 10 VM and install Windows, but my next issue that after installing the Nvidia 512.78_grid_win10_win11_server2016_server2019_server2022_64bit_international.exe driver, is its bricked the windows install and goes into recovery

After letting windows remove the driver and a reboot, and then having another go, the VM nows boots, but I get a black screen and I notice RAM usage is 30 of 32GB. I cloned it after re-installing and before adding the nvidia driver, added a vgpu on the next VF and now I do see 2 vGPU's in nvidia-smi... but have managed to brick that one too.

I also have to reboot the whole chassis, as shut down in Proxmox just hangs. VF's need re-enabling after a reboot too. QM config below.

Is maybe the args: -uuid 00000000-0000-0000-0000-000000000100 confusing the two VM's?

root@pve:~# qm config 100

agent: 1

args: -uuid 00000000-0000-0000-0000-000000000100

bios: ovmf

boot: order=ide0;net0;ide2

cores: 8

efidisk0: local-lvm:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M

hostpci0: 0000:01:00.4,mdev=nvidia-528,pcie=1

ide0: local-lvm:vm-100-disk-1,size=150G

ide2: local:iso/virtio-win.iso,media=cdrom,size=519172K

machine: pc-q35-6.2

memory: 32768

meta: creation-qemu=6.2.0,ctime=1658686150

name: WIN10

net0: e1000=C2:BE:C1:BC:81:38,bridge=vmbr0,firewall=1

numa: 0

ostype: win10

scsihw: virtio-scsi-pci

smbios1: uuid=cf0b3cf1-a8fa-4ac2-81d0-9b3acc927c7e

sockets: 1

vga: virtio

vmgenid: ad174d01-237c-456e-80f8-2b1a8209d73b

root@pve:~# qm config 101

agent: 1

args: -uuid 00000000-0000-0000-0000-000000000100

bios: ovmf

boot: order=ide0;net0;ide2

cores: 4

efidisk0: local-lvm:vm-101-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M

hostpci0: 0000:01:00.5,mdev=nvidia-528,pcie=1,x-vga=1

ide0: local-lvm:vm-101-disk-1,size=150G

ide2: local:iso/virtio-win.iso,media=cdrom,size=519172K

machine: pc-q35-6.2

memory: 32768

meta: creation-qemu=6.2.0,ctime=1658686150

name: WIN10CLONE

net0: e1000=66:2F:5B:35:60:09,bridge=vmbr0,firewall=1

numa: 0

ostype: win10

scsihw: virtio-scsi-pci

smbios1: uuid=ee32f875-889f-405c-a938-ff6881779777

sockets: 1

vga: virtio

vmgenid: 94f69e36-8aff-4614-b593-e98d140c8f11

dcsapak · Jul 25, 2022

Krony said:
Is maybe the args: -uuid 00000000-0000-0000-0000-000000000100 confusing the two VM's?

yeah you have to add the correct uuid, namely we generate them from the vmid and hostpci index, basically we do

<hostpci-index-padded-to-8-chars>-0000-000-000-<vmid-padded-to-12-chars>

so if the mdev is on hostpci1 and vmid 234 the uuid for that vgpu is:
00000001-0000-0000-0000-000000000234

we'll improve that so that we automatically add the uuid in the case of vgpu passthrough on nvidia
(i now have an rtx a5000 here to test, so we can improve the usage of that more easily)

Krony · Jul 25, 2022

So, like:

args: -uuid 00000001-0000-0000-0000-000000000100 for 100.conf
args: -uuid 00000001-0000-0000-0000-000000000101 for 101.conf
args: -uuid 00000001-0000-0000-0000-000000000102 for 102.conf
etc?

Still no joy, "paged fault in non paged area" on the VM's

root@pve:/etc/pve/local/qemu-server# lspci | grep NVID
01:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)

dcsapak · Jul 26, 2022

Krony said:
So, like:

args: -uuid 00000001-0000-0000-0000-000000000100 for 100.conf
args: -uuid 00000001-0000-0000-0000-000000000101 for 101.conf
args: -uuid 00000001-0000-0000-0000-000000000102 for 102.conf
etc?

not exactly, as i see you use hostpci0 , in that case the uuids must be

Code:

00000000-0000-0000-0000-000000000100
00000000-0000-0000-0000-000000000101
00000000-0000-0000-0000-000000000102

for vm 100,101,102 respectively

Krony said:
root@pve:/etc/pve/local/qemu-server# lspci | grep NVID
01:00.0 3D controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)

is that the complete output?
can you post an undedited 'lspci -nn' ?

i have a RTX A5000 here and i have successfully tested windows vms with vgpus...
(we'll post a wiki article in the near future)

anything in the dmesg/journal while the vm bluescreens?

Krony · Jul 27, 2022

Glad to hear its working, I didn't think the A5000 supports vGPU? Did you use the Linux KVM Nvidia Grid bundle (with included windows display driver), or the Ubuntu version?

Dmesg now showing [ 82.772848] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: vGPU migration disabled" for each BSOD'ing VM

lspci -nn and dmesg output attached. Thanks again

Krony · Jul 27, 2022

FYI. After installing the Linux KVM grid (after purging the Ubuntu install) on the host and then matching display driver on the Win10 VM, same deal. BSOD and the below dmesg

[ 91.077566] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: vGPU migration disabled

Nvidia A6000 vGPU 14.1 Proxmox 7.2.7 **NVRM: Aborting probe for VF 0000:01:00.4 since PF is not bound to nvidia driver

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

New Member

Proxmox Staff Member

New Member

New Member

Proxmox Staff Member

New Member

Attachments

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Attachments

New Member