TASK ERROR: pci device '0000:3b:00.0' has no available instances of 'nvidia-267'

MisterDeeds · Dec 22, 2022

Dear all

We run a VDI environment with graphics support by using Proxmox. Since the latest driver update from NVIDIA, VMs that are turned off can no longer be turned on. The following error message always appears.

TASK ERROR: pci device '0000:3b:00.0' has no available instances of 'nvidia-267'.

Available instances would be available.

I have then installed the latest mdevctl version (1.2.0-2), unfortunately also without success.

vm.conf:

Code:

root@PVE001:~# cat /etc/pve/qemu-server/104.conf
agent: 1,fstrim_cloned_disks=1
args: -uuid 00000000-0000-0000-0000-000000000104
bios: ovmf
boot: order=sata0;ide2
cores: 4
cpu: host
efidisk0: PVNAS1-Vm:104/vm-104-disk-0.qcow2,size=128K
hostpci0: 0000:3b:00.0,device-id=0x1e30,mdev=nvidia-267,sub-device-id=0x129e,sub-vendor-id=0x10de,vendor-id=0x10de,x-vga=1
ide2: none,media=cdrom
machine: pc-q35-5.2
memory: 30720
name: vPC05
net0: virtio=76:00:AB:86:9E:C1,bridge=vmbr1,tag=100
numa: 1
onboot: 1
ostype: win10
sata0: PVNAS1-Vm:104/vm-104-disk-1.qcow2,cache=writeback,discard=on,size=250G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=512effce-a07c-4b6d-9c5f-e75f63db4ddc
sockets: 2
vmgenid: 7b1fb554-8dc8-4e27-a2ea-ee3310b39c9d

pveversion -v

Code:

root@PVE001:~# pveversion -v
proxmox-ve: 7.3-1 (running kernel: 5.15.74-1-pve)
pve-manager: 7.3-3 (running version: 7.3-3/c3928077)
pve-kernel-5.15: 7.2-14
pve-kernel-helper: 7.2-14
pve-kernel-5.13: 7.1-9
pve-kernel-5.4: 6.4-15
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.64-1-pve: 5.15.64-1
pve-kernel-5.15.60-2-pve: 5.15.60-2
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.4.174-2-pve: 5.4.174-2
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-1
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-1
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.3.1-1
proxmox-backup-file-restore: 2.3.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-1
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.5-6
pve-ha-manager: 3.5.1
pve-i18n: 2.8-1
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-1
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1

After a reboot of the whole host, the VMs work again until you turn them off.

Only 9 instances of nvidia-267 are assigned, but there should be 12 available... However, 3 VMs can no longer be switched on. Here is the extract from NVIDIA/mdevctl:

Code:

root@PVE001:~# qm start 104
pci device '0000:3b:00.0' has no available instances of 'nvidia-267'
root@PVE001:~# mdevctl list
00000000-0000-0000-0000-000000000100 0000:3b:00.0 nvidia-267 manual
00000000-0000-0000-0000-000000000101 0000:3b:00.0 nvidia-267 manual
00000000-0000-0000-0000-000000000102 0000:3b:00.0 nvidia-267 manual
00000000-0000-0000-0000-000000000103 0000:3b:00.0 nvidia-267 manual
00000000-0000-0000-0000-000000000107 0000:3b:00.0 nvidia-267 manual
00000000-0000-0000-0000-000000000108 0000:3b:00.0 nvidia-267 manual
00000000-0000-0000-0000-000000000109 0000:3b:00.0 nvidia-267 manual
00000000-0000-0000-0000-000000000110 0000:3b:00.0 nvidia-267 manual
00000000-0000-0000-0000-000000000111 0000:3b:00.0 nvidia-267 manual
00000000-0000-0000-0000-000000000112 0000:af:00.0 nvidia-268 manual
00000000-0000-0000-0000-000000000113 0000:af:00.0 nvidia-268 manual
00000000-0000-0000-0000-000000000114 0000:af:00.0 nvidia-268 manual
00000000-0000-0000-0000-000000000115 0000:af:00.0 nvidia-268 manual
00000000-0000-0000-0000-000000000116 0000:af:00.0 nvidia-268 manual
00000000-0000-0000-0000-000000000117 0000:af:00.0 nvidia-268 manual
00000000-0000-0000-0000-000000000118 0000:af:00.0 nvidia-268 manual
00000000-0000-0000-0000-000000000119 0000:af:00.0 nvidia-268 manual

root@PVE001:~# mdevctl types
0000:3b:00.0
  nvidia-264
    Available instances: 0
    Device API: vfio-pci
    Name: GRID RTX8000-1Q
    Description: num_heads=4, frl_config=60, framebuffer=1024M, max_resolution=5120x2880, max_instance=32
  nvidia-265
    Available instances: 0
    Device API: vfio-pci
    Name: GRID RTX8000-2Q
    Description: num_heads=4, frl_config=60, framebuffer=2048M, max_resolution=7680x4320, max_instance=24
  nvidia-266
    Available instances: 0
    Device API: vfio-pci
    Name: GRID RTX8000-3Q
    Description: num_heads=4, frl_config=60, framebuffer=3072M, max_resolution=7680x4320, max_instance=16
  nvidia-267
    Available instances: 0
    Device API: vfio-pci
    Name: GRID RTX8000-4Q
    Description: num_heads=4, frl_config=60, framebuffer=4096M, max_resolution=7680x4320, max_instance=12
    
root@PVE001:~# nvidia-smi
Thu Dec 22 07:05:50 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.12    Driver Version: 525.60.12    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 8000     On   | 00000000:3B:00.0 Off |                  Off |
| 33%   38C    P8    28W / 260W |  44928MiB / 49152MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     On   | 00000000:AF:00.0 Off |                  Off |
| 33%   34C    P8    35W / 260W |  49000MiB / 49152MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2241    C+G   vgpu                             4956MiB |
|    0   N/A  N/A      2312    C+G   vgpu                             4125MiB |
|    0   N/A  N/A      2431    C+G   vgpu                             4345MiB |
|    0   N/A  N/A      2726    C+G   vgpu                             4184MiB |
|    0   N/A  N/A      3321    C+G   vgpu                             4125MiB |
|    0   N/A  N/A      5689    C+G   vgpu                             4126MiB |
|    0   N/A  N/A      6666    C+G   vgpu                             4126MiB |
|    0   N/A  N/A      7501    C+G   vgpu                             4126MiB |
|    0   N/A  N/A      8556    C+G   vgpu                             4126MiB |
|    0   N/A  N/A      9117    C+G   vgpu                             4126MiB |
|    0   N/A  N/A     10201    C+G   vgpu                             4126MiB |
|    1   N/A  N/A     11292    C+G   vgpu                             6156MiB |
|    1   N/A  N/A     12339    C+G   vgpu                             6156MiB |
|    1   N/A  N/A     13356    C+G   vgpu                             6156MiB |
|    1   N/A  N/A     14488    C+G   vgpu                             6156MiB |
|    1   N/A  N/A     15879    C+G   vgpu                             6380MiB |
|    1   N/A  N/A     16823    C+G   vgpu                             6156MiB |
|    1   N/A  N/A     17945    C+G   vgpu                             6156MiB |
|    1   N/A  N/A     19448    C+G   vgpu                             6156MiB |
+-----------------------------------------------------------------------------+

Does anyone have the same problem or a solution for it?

Thank you and Merry Christmas!

noel. · Dec 22, 2022

this bugzilla thread might be of interest to you

julienW · Dec 22, 2022

Hi

I've the same problem, does anyone try another kernel such as proposed in bugzilla thread?

Tks!

MisterDeeds · Dec 23, 2022

Bear both

Thank you for the feedback. I decided to go back to driver version 14 ([14.4], host: 510.108.03, guest: 514.08). With the kernel 5.15.74-1-pve everything is working again.

Best regards

xrr · Mar 12, 2023

Hi all, I don't know if it has been already resolved somewhere else so I thought I would take 5mn to share my workaround to avoir reverting to older drivers...

It seems that for some reasons the vcpu is not properly released when shutting down a vm.

If you use:


mdevctl list

you will get something like:


00000000-0000-0000-0000-000000000115 0000:84:00.0 nvidia-61

then to release it you just need to do:


mdevctl stop -u 00000000-0000-0000-0000-000000000115

and taddaaa you can restart the vm.

The end of the UUID seems to be the vm number so I guess it could be added to the shutdown script of the vm to force the release...

Hope it can help others..

noel. · Mar 13, 2023

Thanks for sharing your workaround.
Out of curiosity, which kernel are you on? Still 5.15.74 like the original post or newer?

Edit: @fweber just showed me this patch [1] (which has not yet been applied) that seems like it could fix this issue.

[1]: https://lists.proxmox.com/pipermail/pve-devel/2023-February/055933.html

xrr · Mar 13, 2023

Thanks noel, my workaround actually worked once but wasn't a proper solution and I ran into multiple issues after that.

I'm on 5.15.83-1-pve

In many cases no device was listed using "mdevctl list" but still present and unremovable making reboot mandatory...

I did end-up applying the patch you shared after reading other discussion on the forum and so far everything is fine...

Thanks!

Search

Search

TASK ERROR: pci device '0000:3b:00.0' has no available instances of 'nvidia-267'

MisterDeeds

Well-Known Member

noel.

Active Member

julienW

Member

MisterDeeds

Well-Known Member

xrr

New Member

noel.

Active Member

xrr

New Member

We value your privacy