Background
- supermicro server with two rtx 8000 cards
- nvidia grid kvm installed managing the cards
- nvidia-smi says all cards are good to go.
- cards are on slot 00000:01:00.0 and 0000:41:00.0
When I boot the system and go to a VM and under hardware add PCI Device
- pick Device 0000:01:00.0
- pick from the mdev list nvidia-xxx in my case 4xx
-- at first the list will have all nvidia-xxx with available units aka 402 has 32 403 has 24 404 has 32 ect...
- All functions greyed out becuase these cards don't have heads or audio subfunctions
- leave primary GPU unchecked, not using the cards for video output
- PCI-Express checked (Have left un checked also)
- ROM-Bar defaults to checked.
If I do a qm start 108 (108 was the vm I was testing)
I get these errors
]root@mtvmserver:~# qm start 108
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:01:00.0/00000000-0000-0000-0000-000000000108,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: vfio 00000000-0000-0000-0000-000000000108: error getting device from group 153: Connection timed out
Verify all devices in group 153 are bound to vfio-<bus> or pci-stub and not already in use
start failed: QEMU exited with code 1
or I get this if I change to the other card
root@mtvmserver:~# qm start 108
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:41:00.0/00000000-0000-0000-0000-000000000108,id=hostpci0,bus=pci.0,addr=0x10: vfio /sys/bus/pci/devices/0000:41:00.0/00000000-0000-0000-0000-000000000108: no such host device: No such file or directory
start failed: QEMU exited with code 1
if I edit /etc/pve/qemu-server/108.conf and add
args: -uuid 00000000-0000-0000-0000-000000000108
then switch back to card 01
then I get
root@mtvmserver:~# qm start 108
mdev instance '00000000-0000-0000-0000-000000000108' already existed, using it.
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:01:00.0/00000000-0000-0000-0000-000000000108,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: warning: vfio 00000000-0000-0000-0000-000000000108: Could not enable error recovery for the device
why do I need use the args: -uuid to force the creation of the directories
root@mtvmserver:~# ls -la /sys/bus/pci/devices/0000\:01\:00.0/
total 0
drwxr-xr-x 10 root root 0 Mar 19 11:15 .
drwxr-xr-x 9 root root 0 Mar 19 11:15 ..
drwxr-xr-x 4 root root 0 Mar 19 11:24 00000000-0000-0000-0000-000000000102
drwxr-xr-x 4 root root 0 Mar 19 11:22 00000000-0000-0000-0000-000000000108
why is the mdev list now all "0" except for the one I picked initially, in my case nvidia-405 with count of now 21
This seems to occur after I run a VM start at least once, after that I cannot pick a different mdev device on the same card and other vm's get the same issue, if I then switch on another vm to card 41 I again see lots to choose from but not on card 01.
I have attached some screenshots to show the dialog box in question, PCI Devices
- supermicro server with two rtx 8000 cards
- nvidia grid kvm installed managing the cards
- nvidia-smi says all cards are good to go.
- cards are on slot 00000:01:00.0 and 0000:41:00.0
When I boot the system and go to a VM and under hardware add PCI Device
- pick Device 0000:01:00.0
- pick from the mdev list nvidia-xxx in my case 4xx
-- at first the list will have all nvidia-xxx with available units aka 402 has 32 403 has 24 404 has 32 ect...
- All functions greyed out becuase these cards don't have heads or audio subfunctions
- leave primary GPU unchecked, not using the cards for video output
- PCI-Express checked (Have left un checked also)
- ROM-Bar defaults to checked.
If I do a qm start 108 (108 was the vm I was testing)
I get these errors
]root@mtvmserver:~# qm start 108
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:01:00.0/00000000-0000-0000-0000-000000000108,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: vfio 00000000-0000-0000-0000-000000000108: error getting device from group 153: Connection timed out
Verify all devices in group 153 are bound to vfio-<bus> or pci-stub and not already in use
start failed: QEMU exited with code 1
or I get this if I change to the other card
root@mtvmserver:~# qm start 108
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:41:00.0/00000000-0000-0000-0000-000000000108,id=hostpci0,bus=pci.0,addr=0x10: vfio /sys/bus/pci/devices/0000:41:00.0/00000000-0000-0000-0000-000000000108: no such host device: No such file or directory
start failed: QEMU exited with code 1
if I edit /etc/pve/qemu-server/108.conf and add
args: -uuid 00000000-0000-0000-0000-000000000108
then switch back to card 01
then I get
root@mtvmserver:~# qm start 108
mdev instance '00000000-0000-0000-0000-000000000108' already existed, using it.
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:01:00.0/00000000-0000-0000-0000-000000000108,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: warning: vfio 00000000-0000-0000-0000-000000000108: Could not enable error recovery for the device
why do I need use the args: -uuid to force the creation of the directories
root@mtvmserver:~# ls -la /sys/bus/pci/devices/0000\:01\:00.0/
total 0
drwxr-xr-x 10 root root 0 Mar 19 11:15 .
drwxr-xr-x 9 root root 0 Mar 19 11:15 ..
drwxr-xr-x 4 root root 0 Mar 19 11:24 00000000-0000-0000-0000-000000000102
drwxr-xr-x 4 root root 0 Mar 19 11:22 00000000-0000-0000-0000-000000000108
why is the mdev list now all "0" except for the one I picked initially, in my case nvidia-405 with count of now 21
This seems to occur after I run a VM start at least once, after that I cannot pick a different mdev device on the same card and other vm's get the same issue, if I then switch on another vm to card 41 I again see lots to choose from but not on card 01.
I have attached some screenshots to show the dialog box in question, PCI Devices
Last edited: