Hi there,
we have been pulling at our hairs (all available) as we just can not get our Nvidia A30 Gpu's to work any more since moving to the latest kernel (6.8.12-1-pve). All steps, documents and latest drivers have been implemented and on our hosts we can see the GPUs but as soon as we add the GPU via cli to a VM (ubuntu only vm's) we get this;
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:44:00.4: vfio 0000:44:00.4: group 28 is not viable
Please ensure all devices within the iommu_group are bound to their vfio bus driver.
TASK ERROR: start failed: QEMU exited with code 1
we even tried changing from "-device vfio-pci" to "-device nvidia" and get this;
kvm: -device nvidia,sysfsdev=/sys/bus/pci/devices/0000:44:00.4: 'nvidia' is not a valid device model name
TASK ERROR: start failed: QEMU exited with code 1
running a "lspci -d 10de: -k" we get this;
44:00.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
44:00.4 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
44:00.5 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
44:00.6 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
44:00.7 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
44:01.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
44:01.1 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
44:01.2 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
44:01.3 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
c4:00.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
c4:00.4 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
c4:00.5 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
c4:00.6 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
c4:00.7 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
c4:01.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
c4:01.1 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
c4:01.2 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
c4:01.3 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
running nvidia-smi we see this;
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.05 Driver Version: 550.90.05 CUDA Version: N/A |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A30 On | 00000000:44:00.0 Off | Off |
| N/A 30C P0 29W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A30 On | 00000000:C4:00.0 Off | 0 |
| N/A 31C P0 32W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
(sorry formatting is a bit off)
but one can see the GPUs are there...
sriov-manage also states that all is good;
GPU at 0000:44:00.0 already has VFs enabled.
GPU at 0000:c4:00.0 already has VFs enabled.
in /etc/modprobe.d/blacklist.conf we have the following set;
blacklist nouveau
blacklist nvidia
Where are we going wrong?
Any tips are advise would be awesome.
Thanks in advance.
we have been pulling at our hairs (all available) as we just can not get our Nvidia A30 Gpu's to work any more since moving to the latest kernel (6.8.12-1-pve). All steps, documents and latest drivers have been implemented and on our hosts we can see the GPUs but as soon as we add the GPU via cli to a VM (ubuntu only vm's) we get this;
kvm: -device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:44:00.4: vfio 0000:44:00.4: group 28 is not viable
Please ensure all devices within the iommu_group are bound to their vfio bus driver.
TASK ERROR: start failed: QEMU exited with code 1
we even tried changing from "-device vfio-pci" to "-device nvidia" and get this;
kvm: -device nvidia,sysfsdev=/sys/bus/pci/devices/0000:44:00.4: 'nvidia' is not a valid device model name
TASK ERROR: start failed: QEMU exited with code 1
running a "lspci -d 10de: -k" we get this;
44:00.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
44:00.4 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
44:00.5 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
44:00.6 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
44:00.7 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
44:01.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
44:01.1 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
44:01.2 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
44:01.3 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
c4:00.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
c4:00.4 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
c4:00.5 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
c4:00.6 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
c4:00.7 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
c4:01.0 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
c4:01.1 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
c4:01.2 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
c4:01.3 3D controller: NVIDIA Corporation GA100GL [A30 PCIe] (rev a1)
Subsystem: NVIDIA Corporation GA100GL [A30 PCIe]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_vgpu_vfio, nvidia
running nvidia-smi we see this;
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.05 Driver Version: 550.90.05 CUDA Version: N/A |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A30 On | 00000000:44:00.0 Off | Off |
| N/A 30C P0 29W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A30 On | 00000000:C4:00.0 Off | 0 |
| N/A 31C P0 32W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
(sorry formatting is a bit off)
but one can see the GPUs are there...
sriov-manage also states that all is good;
GPU at 0000:44:00.0 already has VFs enabled.
GPU at 0000:c4:00.0 already has VFs enabled.
in /etc/modprobe.d/blacklist.conf we have the following set;
blacklist nouveau
blacklist nvidia
Where are we going wrong?
Any tips are advise would be awesome.
Thanks in advance.