Problems with mdev instance creation

wmerkens · Mar 19, 2021

Background
- supermicro server with two rtx 8000 cards
- nvidia grid kvm installed managing the cards
- nvidia-smi says all cards are good to go.
- cards are on slot 00000:01:00.0 and 0000:41:00.0

When I boot the system and go to a VM and under hardware add PCI Device
- pick Device 0000:01:00.0
- pick from the mdev list nvidia-xxx in my case 4xx
-- at first the list will have all nvidia-xxx with available units aka 402 has 32 403 has 24 404 has 32 ect...
- All functions greyed out becuase these cards don't have heads or audio subfunctions
- leave primary GPU unchecked, not using the cards for video output
- PCI-Express checked (Have left un checked also)
- ROM-Bar defaults to checked.

If I do a qm start 108 (108 was the vm I was testing)

I get these errors

]root@mtvmserver:~# qm start 108
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:01:00.0/00000000-0000-0000-0000-000000000108,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: vfio 00000000-0000-0000-0000-000000000108: error getting device from group 153: Connection timed out
Verify all devices in group 153 are bound to vfio-<bus> or pci-stub and not already in use
start failed: QEMU exited with code 1

or I get this if I change to the other card

root@mtvmserver:~# qm start 108
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:41:00.0/00000000-0000-0000-0000-000000000108,id=hostpci0,bus=pci.0,addr=0x10: vfio /sys/bus/pci/devices/0000:41:00.0/00000000-0000-0000-0000-000000000108: no such host device: No such file or directory
start failed: QEMU exited with code 1

if I edit /etc/pve/qemu-server/108.conf and add

args: -uuid 00000000-0000-0000-0000-000000000108

then switch back to card 01

then I get

root@mtvmserver:~# qm start 108
mdev instance '00000000-0000-0000-0000-000000000108' already existed, using it.
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:01:00.0/00000000-0000-0000-0000-000000000108,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: warning: vfio 00000000-0000-0000-0000-000000000108: Could not enable error recovery for the device

why do I need use the args: -uuid to force the creation of the directories

root@mtvmserver:~# ls -la /sys/bus/pci/devices/0000\:01\:00.0/
total 0
drwxr-xr-x 10 root root 0 Mar 19 11:15 .
drwxr-xr-x 9 root root 0 Mar 19 11:15 ..
drwxr-xr-x 4 root root 0 Mar 19 11:24 00000000-0000-0000-0000-000000000102
drwxr-xr-x 4 root root 0 Mar 19 11:22 00000000-0000-0000-0000-000000000108

why is the mdev list now all "0" except for the one I picked initially, in my case nvidia-405 with count of now 21

This seems to occur after I run a VM start at least once, after that I cannot pick a different mdev device on the same card and other vm's get the same issue, if I then switch on another vm to card 41 I again see lots to choose from but not on card 01.

I have attached some screenshots to show the dialog box in question, PCI Devices

mishki · Jun 7, 2021

How do you solved this problem?

wmerkens · Jun 7, 2021

mishki said:
How do you solved this problem?

It turned out to be a lack of knowledge on how the mdev works, So when you assign an Nvidia grid license to the first card on the first VM it sets the license choice to that one for every card next assigned to the next VM.

Those numbers you see equate to a license type on which Nvidia has a document on, also the more vGPU's you assign to a card the smaller the frame buffer available to the vGPU. This caught us when using the NVENC/NVDEC of the vGPU.

Once I understood the mapping and what the limits are it was easy to simply remove all cards from all VM"s to reset things so I could then pick a different license, for example changing from an 8 vGPU license per card to a 24 vGPU per card (nvidia--405 to nvidia-402 I believe)

One problem you can see in the screenshots is that the pick box is too narrow, you really need to see the hidden columns, ProxMox needs to fix that.

As far as the startup problem, Proxmox needs to add

Code:

args: -uuid 00000000-0000-0000-0000-000000000108

lines to the confs of these VM's when you do a grid driver based VM's, I do that manually at the moment

mishki · Jun 8, 2021

wmerkens said:
As far as the startup problem, Proxmox needs to add

Code:

args: -uuid 00000000-0000-0000-0000-000000000108

lines to the confs of these VM's when you do a grid driver based VM's, I do that manually at the moment

Yes, it's works for me too.

Thanks for the info. If it is not difficult to answer a couple more questions

The questions is about the driver.:
1. On proxmox i have to use NVIDIA-Linux-x86_64-460.73.02-vgpu-kvm.run?

2. On a VM with win10 can I use driver from https://www.nvidia.com/Download/index.aspx?
(466.11-quadro-rtx-desktop-notebook-win10-64bit-international-dch-whql.exe)?

Or the one in the archive? (462.31_grid_win10_server2016_server2019_64bit_international.exe).

about Proxmox:
Any other settings in the Proxmox were made?
(besides edited /etc/kernel/cmdline and add: intel_iommu=on)

wmerkens · Jun 8, 2021

mishki said:
Yes, it's works for me too.

Thanks for the info. If it is not difficult to answer a couple more questions

The questions is about the driver.:
1. On proxmox i have to use NVIDIA-Linux-x86_64-460.73.02-vgpu-kvm.run?

The install from Nvidia is in three sections, the host driver which is that one, the driver that goes into the VM
NVIDIA-Linux-x86_64-460.73.01-grid.run or its windows equivalent and the license server which can just run in a VM.

mishki said:
2. On a VM with win10 can I use driver from https://www.nvidia.com/Download/index.aspx?
(466.11-quadro-rtx-desktop-notebook-win10-64bit-international-dch-whql.exe)?

No, I don't think so, you have to use the grid version otherwise you will probably get the same error that Linux shows when you use a non-grid driver in the VM which is the wrong card or driver detected.

mishki said:
Or the one in the archive? (462.31_grid_win10_server2016_server2019_64bit_international.exe).

Normally when you use grid you download from the license/grid portal page
NVIDIA-GRID-Linux-KVM-460.73.02-460.73.01-462.31.zip

This contains all the parts plus docs

also contains
462.31_grid_server2012R2_64bit_international.exe
462.31_grid_win10_server2016_server2019_64bit_international.exe

I believe you would use the 2nd one for a win10 VM

mishki said:
about Proxmox:
Any other settings in the Proxmox were made?
(besides edited /etc/kernel/cmdline and add: intel_iommu=on)

Yeah make sure nouveau is not loaded or installed.

If you plan to treat the card as a passthrough you will need a vfio-pci.conf file in /etc/modprobe.d

options vfio-pci ids=10de:1e30,10de:10f7,10de:1ad6,10de:1ad7

For example passthrough an rtx 6000

in /etc/default/grub i used
amd_iommu=on iommu=pt

Hope that helps.

mishki · Jun 10, 2021

Thanks for the detailed answers.

wmerkens said:
As far as the startup problem, Proxmox needs to add

Code:

args: -uuid 00000000-0000-0000-0000-000000000108

lines to the confs of these VM's when you do a grid driver based VM's, I do that manually at the moment

Do you think Proxmox developers can fix this?

wmerkens said:
One problem you can see in the screenshots is that the pick box is too narrow, you really need to see the hidden columns, ProxMox needs to fix that.

Posted in Releases topic.

________

It seems that I have achieved that everything works for me (only pve-kernel-5.4.xxx-pve):

wmerkens · Jun 10, 2021

mishki said:
Thanks for the detailed answers.

Do you think Proxmox developers can fix this?

I put in a request for a feature improvement

mishki said:
Posted in Releases topic.

________

It seems that I have achieved that everything works for me (only pve-kernel-5.4.xxx-pve):

Douglas · Jun 15, 2021

Hi @wmerkens
I have the same problem here, on RedHat works fine with a Quadro RTX8000.
Which version of proxmox do you use? I will try with the same version
Already added the uuid in the file, but doesn't boot.

Edit: Which version of nvidia vgpu driver do you use?

mishki · Jun 15, 2021

Douglas said:
Hi @wmerkens
I have the same problem here, on RedHat works fine with a Quadro RTX8000.
Which version of proxmox do you use? I will try with the same version
Already added the uuid in the file, but doesn't boot.

Edit: Which version of nvidia vgpu driver do you use?

I'm not wmerkens, but I use latest updates:

Proxmox 6.4-8:
pve-kernel-5.4.119-1-pve
pve-headers-5.4.119-1-pve
apt install build-essential gcc-multilib dkms

NVIDIA-Linux-x86_64-460.32.04-vgpu-kvm.run

VM:
qm config 825:

Code:

agent: 1
args: -uuid 00000000-0000-0000-0000-000000000825
bios: ovmf
boot: order=scsi0
cores: 8
efidisk0: ssdlvm:vm-825-disk-1,size=128K
hostpci0: 0000:18:00.0,mdev=nvidia-444,pcie=1,x-vga=1
machine: pc-q35-5.2
memory: 16384
name: w10showdesktop
net0: virtio=E6:D0:71:79:B6:4A,bridge=vmbr0,firewall=1
numa: 1
ostype: win10
scsi0: ssdlvm:vm-825-disk-0,cache=writeback,discard=on,size=100G
scsihw: virtio-scsi-pci
smbios1: uuid=e7cb7505-4cb4-482e-9111-4f36fc23926c
sockets: 1
vga: virtio
vmgenid: d58d4539-b7d4-4a1b-94fb-d4244e167416

what output: qm start XXX?

Douglas · Jun 15, 2021

Hi @mishki

Only the version of proxmox and nvidia driver, is the same as yours. I will update the kernel and try again.

The output of qm start vmid is:


root@bkloud-lab:~# qm start 100
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:3b:00.0/00000000-0000-0000-0000-000000000100,id=hostpci0,bus=pci.0,addr=0x10,rombar=0: vfio 00000000-0000-0000-0000-000000000100: error getting device from group 62: Connection timed out
Verify all devices in group 62 are bound to vfio-<bus> or pci-stub and not already in use
start failed: QEMU exited with code 1

I found this on syslog


Jun 14 20:46:14 bkloud-lab nvidia-vgpu-mgr[3337]: error: vmiop_env_log: Failed to get VM UUID from QEMU command-line 0x57
Jun 14 20:46:14 bkloud-lab nvidia-vgpu-mgr[3337]: error: vmiop_env_log: kvm_plugin_global_init failed with error 0x57
Jun 14 20:46:25 bkloud-lab kernel: [  584.276249] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000101: start failed. status: 0x65 Timeout Occured

The vm config:


root@bkloud-lab:~# qm config 100
args: --uuid 00000000-0000-0000-0000-000000000100
bios: ovmf
bootdisk: ide0
cores: 8
cpu: host
efidisk0: local-lvm:vm-100-disk-1,size=4M
hostpci0: 3b:00.0,mdev=nvidia-264,rombar=0
ide0: local-lvm:vm-100-disk-0,cache=writeback,size=61G
ide2: local:iso/Windows10__2020_.iso,media=cdrom
memory: 16384
name: Windows10
net0: e1000=E2:15:D5:FE:05:85,bridge=vmbr0
numa: 0
ostype: win10
smbios1: uuid=b871a7b1-baa3-475c-af2e-567e69012d2f
sockets: 1
vmgenid: cb4f22c4-65d2-4a0c-a4fc-ada3b98f1042

mishki · Jun 15, 2021

Douglas said:
Hi @mishki
root@bkloud-lab:~# qm config 100 args: --uuid 00000000-0000-0000-0000-000000000100

args: -uuid 00000000-0000-0000-0000-000000000100

Douglas · Jun 16, 2021

Well...it works.

When exec qm start vmid, this warning is displayed, but starts anyway.

kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:3b:00.0/00000000-0000-0000-0000-000000000103,id=hostpci0,bus=ich9-pcie-port-fio 00000000-0000-0000-0000-000000000103: Could not enable error recovery for the device

Thank u so much.

wmerkens · Jun 18, 2021

Douglas said:
Well...it works.

When exec qm start vmid, this warning is displayed, but starts anyway.

kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:3b:00.0/00000000-0000-0000-0000-000000000103,id=hostpci0,bus=ich9-pcie-port-fio 00000000-0000-0000-0000-000000000103: Could not enable error recovery for the device

Thank u so much.

I get those also even when the server has that feature enabled in the bios it does not seem to work.

AxelTwin · Jan 26, 2022

for those who are interested:

https://wvthoog.nl/proxmox-7-vgpu/

lexel · May 26, 2022

wmerkens said:
It turned out to be a lack of knowledge on how the mdev works, So when you assign an Nvidia grid license to the first card on the first VM it sets the license choice to that one for every card next assigned to the next VM.

How do you get/assign Nvidia grid licenses?

mishki · May 27, 2022

lexel said:
How do you get/assign Nvidia grid licenses?

Nvidia License Server must be installed somewhere, Licenses must be purchased depending on the configuration.

but there is another way: drivers and unlock

(Nvidia deleted all our accounts and purchased licenses and access to the portal, respectively)

Search

Search

Problems with mdev instance creation

wmerkens

Member

mishki

Well-Known Member

wmerkens

Member

mishki

Well-Known Member

wmerkens

Member

mishki

Well-Known Member

Attachments

wmerkens

Member

Douglas

Renowned Member

mishki

Well-Known Member

Douglas

Renowned Member

mishki

Well-Known Member

Douglas

Renowned Member

wmerkens

Member

AxelTwin

Well-Known Member

lexel

New Member

mishki

Well-Known Member

We value your privacy