Problems with mdev instance creation

wmerkens

Member
Mar 19, 2021
5
3
8
60
Background
- supermicro server with two rtx 8000 cards
- nvidia grid kvm installed managing the cards
- nvidia-smi says all cards are good to go.
- cards are on slot 00000:01:00.0 and 0000:41:00.0

When I boot the system and go to a VM and under hardware add PCI Device
- pick Device 0000:01:00.0
- pick from the mdev list nvidia-xxx in my case 4xx
-- at first the list will have all nvidia-xxx with available units aka 402 has 32 403 has 24 404 has 32 ect...
- All functions greyed out becuase these cards don't have heads or audio subfunctions
- leave primary GPU unchecked, not using the cards for video output
- PCI-Express checked (Have left un checked also)
- ROM-Bar defaults to checked.

If I do a qm start 108 (108 was the vm I was testing)

I get these errors

]root@mtvmserver:~# qm start 108
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:01:00.0/00000000-0000-0000-0000-000000000108,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: vfio 00000000-0000-0000-0000-000000000108: error getting device from group 153: Connection timed out
Verify all devices in group 153 are bound to vfio-<bus> or pci-stub and not already in use
start failed: QEMU exited with code 1

or I get this if I change to the other card

root@mtvmserver:~# qm start 108
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:41:00.0/00000000-0000-0000-0000-000000000108,id=hostpci0,bus=pci.0,addr=0x10: vfio /sys/bus/pci/devices/0000:41:00.0/00000000-0000-0000-0000-000000000108: no such host device: No such file or directory
start failed: QEMU exited with code 1

if I edit /etc/pve/qemu-server/108.conf and add

args: -uuid 00000000-0000-0000-0000-000000000108

then switch back to card 01

then I get

root@mtvmserver:~# qm start 108
mdev instance '00000000-0000-0000-0000-000000000108' already existed, using it.
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:01:00.0/00000000-0000-0000-0000-000000000108,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: warning: vfio 00000000-0000-0000-0000-000000000108: Could not enable error recovery for the device

why do I need use the args: -uuid to force the creation of the directories

root@mtvmserver:~# ls -la /sys/bus/pci/devices/0000\:01\:00.0/
total 0
drwxr-xr-x 10 root root 0 Mar 19 11:15 .
drwxr-xr-x 9 root root 0 Mar 19 11:15 ..
drwxr-xr-x 4 root root 0 Mar 19 11:24 00000000-0000-0000-0000-000000000102
drwxr-xr-x 4 root root 0 Mar 19 11:22 00000000-0000-0000-0000-000000000108

why is the mdev list now all "0" except for the one I picked initially, in my case nvidia-405 with count of now 21

This seems to occur after I run a VM start at least once, after that I cannot pick a different mdev device on the same card and other vm's get the same issue, if I then switch on another vm to card 41 I again see lots to choose from but not on card 01.

I have attached some screenshots to show the dialog box in question, PCI Devices



proxmos-1.pngproxmox-2.pngproxmox-3.png
 
Last edited:
How do you solved this problem?
It turned out to be a lack of knowledge on how the mdev works, So when you assign an Nvidia grid license to the first card on the first VM it sets the license choice to that one for every card next assigned to the next VM.

Those numbers you see equate to a license type on which Nvidia has a document on, also the more vGPU's you assign to a card the smaller the frame buffer available to the vGPU. This caught us when using the NVENC/NVDEC of the vGPU.

Once I understood the mapping and what the limits are it was easy to simply remove all cards from all VM"s to reset things so I could then pick a different license, for example changing from an 8 vGPU license per card to a 24 vGPU per card (nvidia--405 to nvidia-402 I believe)

One problem you can see in the screenshots is that the pick box is too narrow, you really need to see the hidden columns, ProxMox needs to fix that.

As far as the startup problem, Proxmox needs to add
Code:
args: -uuid 00000000-0000-0000-0000-000000000108
lines to the confs of these VM's when you do a grid driver based VM's, I do that manually at the moment
 
  • Like
Reactions: mishki
As far as the startup problem, Proxmox needs to add
Code:
args: -uuid 00000000-0000-0000-0000-000000000108
lines to the confs of these VM's when you do a grid driver based VM's, I do that manually at the moment
Yes, it's works for me too.


Thanks for the info. If it is not difficult to answer a couple more questions

The questions is about the driver.:
1. On proxmox i have to use NVIDIA-Linux-x86_64-460.73.02-vgpu-kvm.run?

2. On a VM with win10 can I use driver from https://www.nvidia.com/Download/index.aspx?
(466.11-quadro-rtx-desktop-notebook-win10-64bit-international-dch-whql.exe)?

Or the one in the archive? (462.31_grid_win10_server2016_server2019_64bit_international.exe).

about Proxmox:
Any other settings in the Proxmox were made?
(besides edited /etc/kernel/cmdline and add: intel_iommu=on)
 
Yes, it's works for me too.


Thanks for the info. If it is not difficult to answer a couple more questions

The questions is about the driver.:
1. On proxmox i have to use NVIDIA-Linux-x86_64-460.73.02-vgpu-kvm.run?

The install from Nvidia is in three sections, the host driver which is that one, the driver that goes into the VM
NVIDIA-Linux-x86_64-460.73.01-grid.run or its windows equivalent and the license server which can just run in a VM.

2. On a VM with win10 can I use driver from https://www.nvidia.com/Download/index.aspx?
(466.11-quadro-rtx-desktop-notebook-win10-64bit-international-dch-whql.exe)?

No, I don't think so, you have to use the grid version otherwise you will probably get the same error that Linux shows when you use a non-grid driver in the VM which is the wrong card or driver detected.
Or the one in the archive? (462.31_grid_win10_server2016_server2019_64bit_international.exe).

Normally when you use grid you download from the license/grid portal page
NVIDIA-GRID-Linux-KVM-460.73.02-460.73.01-462.31.zip

This contains all the parts plus docs

also contains
462.31_grid_server2012R2_64bit_international.exe
462.31_grid_win10_server2016_server2019_64bit_international.exe

I believe you would use the 2nd one for a win10 VM


about Proxmox:
Any other settings in the Proxmox were made?
(besides edited /etc/kernel/cmdline and add: intel_iommu=on)

Yeah make sure nouveau is not loaded or installed.

If you plan to treat the card as a passthrough you will need a vfio-pci.conf file in /etc/modprobe.d


options vfio-pci ids=10de:1e30,10de:10f7,10de:1ad6,10de:1ad7

For example passthrough an rtx 6000

in /etc/default/grub i used
amd_iommu=on iommu=pt

Hope that helps.
 
  • Like
Reactions: mishki
Thanks for the detailed answers.

As far as the startup problem, Proxmox needs to add
Code:
args: -uuid 00000000-0000-0000-0000-000000000108
lines to the confs of these VM's when you do a grid driver based VM's, I do that manually at the moment
Do you think Proxmox developers can fix this?


One problem you can see in the screenshots is that the pick box is too narrow, you really need to see the hidden columns, ProxMox needs to fix that.
Posted in Releases topic.

________


It seems that I have achieved that everything works for me (only pve-kernel-5.4.xxx-pve):
 

Attachments

  • pci_mdev_001.png
    pci_mdev_001.png
    106.8 KB · Views: 71
  • nvidia_vgpu_kvm_001.png
    nvidia_vgpu_kvm_001.png
    303.6 KB · Views: 70
  • nvidia_vgpu_kvm_002.resized.png
    nvidia_vgpu_kvm_002.resized.png
    393.8 KB · Views: 65
Last edited:
Hi @wmerkens
I have the same problem here, on RedHat works fine with a Quadro RTX8000.
Which version of proxmox do you use? I will try with the same version
Already added the uuid in the file, but doesn't boot.

Edit: Which version of nvidia vgpu driver do you use?
 
Last edited:
Hi @wmerkens
I have the same problem here, on RedHat works fine with a Quadro RTX8000.
Which version of proxmox do you use? I will try with the same version
Already added the uuid in the file, but doesn't boot.

Edit: Which version of nvidia vgpu driver do you use?

I'm not wmerkens, but I use latest updates:

Proxmox 6.4-8:
pve-kernel-5.4.119-1-pve
pve-headers-5.4.119-1-pve
apt install build-essential gcc-multilib dkms

NVIDIA-Linux-x86_64-460.32.04-vgpu-kvm.run

VM:
qm config 825:
Code:
agent: 1
args: -uuid 00000000-0000-0000-0000-000000000825
bios: ovmf
boot: order=scsi0
cores: 8
efidisk0: ssdlvm:vm-825-disk-1,size=128K
hostpci0: 0000:18:00.0,mdev=nvidia-444,pcie=1,x-vga=1
machine: pc-q35-5.2
memory: 16384
name: w10showdesktop
net0: virtio=E6:D0:71:79:B6:4A,bridge=vmbr0,firewall=1
numa: 1
ostype: win10
scsi0: ssdlvm:vm-825-disk-0,cache=writeback,discard=on,size=100G
scsihw: virtio-scsi-pci
smbios1: uuid=e7cb7505-4cb4-482e-9111-4f36fc23926c
sockets: 1
vga: virtio
vmgenid: d58d4539-b7d4-4a1b-94fb-d4244e167416


what output: qm start XXX?
 
Hi @mishki

Only the version of proxmox and nvidia driver, is the same as yours. I will update the kernel and try again.

The output of qm start vmid is:

root@bkloud-lab:~# qm start 100 kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:3b:00.0/00000000-0000-0000-0000-000000000100,id=hostpci0,bus=pci.0,addr=0x10,rombar=0: vfio 00000000-0000-0000-0000-000000000100: error getting device from group 62: Connection timed out Verify all devices in group 62 are bound to vfio-<bus> or pci-stub and not already in use start failed: QEMU exited with code 1


I found this on syslog

Jun 14 20:46:14 bkloud-lab nvidia-vgpu-mgr[3337]: error: vmiop_env_log: Failed to get VM UUID from QEMU command-line 0x57 Jun 14 20:46:14 bkloud-lab nvidia-vgpu-mgr[3337]: error: vmiop_env_log: kvm_plugin_global_init failed with error 0x57 Jun 14 20:46:25 bkloud-lab kernel: [ 584.276249] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000101: start failed. status: 0x65 Timeout Occured

The vm config:

root@bkloud-lab:~# qm config 100 args: --uuid 00000000-0000-0000-0000-000000000100 bios: ovmf bootdisk: ide0 cores: 8 cpu: host efidisk0: local-lvm:vm-100-disk-1,size=4M hostpci0: 3b:00.0,mdev=nvidia-264,rombar=0 ide0: local-lvm:vm-100-disk-0,cache=writeback,size=61G ide2: local:iso/Windows10__2020_.iso,media=cdrom memory: 16384 name: Windows10 net0: e1000=E2:15:D5:FE:05:85,bridge=vmbr0 numa: 0 ostype: win10 smbios1: uuid=b871a7b1-baa3-475c-af2e-567e69012d2f sockets: 1 vmgenid: cb4f22c4-65d2-4a0c-a4fc-ada3b98f1042
 
Well...it works.

When exec qm start vmid, this warning is displayed, but starts anyway.

kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:3b:00.0/00000000-0000-0000-0000-000000000103,id=hostpci0,bus=ich9-pcie-port-fio 00000000-0000-0000-0000-000000000103: Could not enable error recovery for the device




Thank u so much.
 
Well...it works.

When exec qm start vmid, this warning is displayed, but starts anyway.

kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:3b:00.0/00000000-0000-0000-0000-000000000103,id=hostpci0,bus=ich9-pcie-port-fio 00000000-0000-0000-0000-000000000103: Could not enable error recovery for the device




Thank u so much.
I get those also even when the server has that feature enabled in the bios it does not seem to work.
 
It turned out to be a lack of knowledge on how the mdev works, So when you assign an Nvidia grid license to the first card on the first VM it sets the license choice to that one for every card next assigned to the next VM.
How do you get/assign Nvidia grid licenses?
 
How do you get/assign Nvidia grid licenses?
Nvidia License Server must be installed somewhere, Licenses must be purchased depending on the configuration.

but there is another way: drivers and unlock

(Nvidia deleted all our accounts and purchased licenses and access to the portal, respectively)
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!