Hi, I'm currently battling with nvidia gpu drivers. The issue is -- even though I've successfully was able to passthrough the gpu, no nvidia drivers would work. I'm in dead end rn.
Issue. VM with gpu won't load drivers during kernel boot:
How it was installed (from official nvidia site):
I've tried installing from their
I've tried to install drivers from another
I've tried to create vm from PopOS iso with preinstalled nvidia drivers -- no luck.
I've tried to install older version on ubuntu server zesty and cuda 9.2 -- same thing.
Anything I can do?
Maybe I forgot something in host configuration, maybe I didn't enable something in vm, maybe my card isn't supported in linux at all?
I don't want to install nvidia drivers, cuda, etc to host machine because nvidia's packages are mess, it's safer and easier (yeah not really) to use vm for this, right?
- PVE version: 7.2-4
- Linux: Debian 11 (5.15.35-2-pve)
- GPU: NVIDIA Tesla V100 PCIe 32GB
- CPU: Intel
- VT-d: enabled
- Above 4G Decoding: enabled
- Kernel args:
quiet intel_iommu=on iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:off,efifb:off
- modprobe modules:
- blacklist.conf:
Code:blacklist nouveau blacklist nvidia
- iommu_unsafe_interrupts.conf:
Code:options vfio_iommu_type1 allow_unsafe_interrupts=1
- kvm.conf:
Code:options kvm ignore_msrs=1
- mdadm.conf:
Code:options md_mod start_ro=1
- pve-blacklist.conf:
Code:blacklist nvidiafb
- blacklist.conf:
- /etc/modules:
Code:
usbhid psmouse vfio vfio_iommu_type1 vfio_pci vfio_virqfd
- lspci -v (with nvidia part only):
-
Code:
5e:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1) Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] Flags: fast devsel, IRQ 690, NUMA node 0, IOMMU group 93 Memory at c4000000 (32-bit, non-prefetchable) [size=16M] Memory at 3bf000000000 (64-bit, prefetchable) [size=32G] Memory at 3bf800000000 (64-bit, prefetchable) [size=32M] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Endpoint, MSI 00 Capabilities: [100] Virtual Channel Capabilities: [250] Latency Tolerance Reporting Capabilities: [258] L1 PM Substates Capabilities: [128] Power Budgeting <?> Capabilities: [420] Advanced Error Reporting Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900] Secondary PCI Express Capabilities: [ac0] Designated Vendor-Specific: Vendor=10de ID=0001 Rev=1 Len=12 <?> Kernel driver in use: vfio-pci Kernel modules: nvidiafb, nouveau
-
- Linux: Debian 11 (5.10.0-19-amd64)
- nvidia drivers: 520.61.05-1 amd64
- apt sources:
Code:
http://deb.debian.org/debian bullseye http://security.debian.org/debian-security bullseye-security http://deb.debian.org/debian bullseye-updates https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64
- nvcc --version:
cuda_11.8.r11.8/compiler.31833905_0
- gcc --version:
10.2.1 20210110
- RAM: 16Gb
- Bios: OVMF (UEFI)
- Machine: q35
- PCI device (gpu): pcie=1
- kernel args:
quiet pci=realloc
- secure boot: disabled
- lspci -v (with nvidia part only):
-
Code:
01:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1) Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] Physical Slot: 0 Flags: fast devsel, IRQ 16 Memory at ff000000 (32-bit, non-prefetchable) [size=16M] Memory at <ignored> (64-bit, prefetchable) Memory at <ignored> (64-bit, prefetchable) Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Endpoint, MSI 00 Capabilities: [100] Virtual Channel Capabilities: [250] Latency Tolerance Reporting Capabilities: [128] Power Budgeting <?> Capabilities: [420] Advanced Error Reporting Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Kernel modules: nvidia
-
Issue. VM with gpu won't load drivers during kernel boot:
Code:
Nov 20 23:13:00 gputest kernel: nvidia: loading out-of-tree module taints kernel.
Nov 20 23:13:00 gputest kernel: nvidia: module license 'NVIDIA' taints kernel.
Nov 20 23:13:00 gputest kernel: Disabling lock debugging due to kernel taint
...
Nov 20 23:13:00 gputest kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Nov 20 23:13:00 gputest kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 245
Nov 20 23:13:00 gputest kernel: nvidia 0000:01:00.0: enabling device (0140 -> 0142)
Nov 20 23:13:00 gputest kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:00.0)
Nov 20 23:13:00 gputest kernel: nvidia: probe of 0000:01:00.0 failed with error -1
Nov 20 23:13:00 gputest kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
Nov 20 23:13:00 gputest kernel: NVRM: None of the NVIDIA devices were initialized.
Nov 20 23:13:00 gputest kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 245
...
Nov 20 23:13:00 gputest systemd-modules-load[351]: modprobe: ERROR: could not insert 'nvidia_current': No such device
Nov 20 23:13:00 gputest systemd-modules-load[349]: modprobe: ERROR: ../libkmod/libkmod-module.c:990 command_do() Error running install command 'modprobe -i nvidia-current ' for module nvidia: retcode 1
Nov 20 23:13:00 gputest systemd-modules-load[349]: modprobe: ERROR: could not insert 'nvidia': Invalid argument
Nov 20 23:13:00 gputest kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 245
Nov 20 23:13:00 gputest kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:00.0)
Nov 20 23:13:00 gputest kernel: nvidia: probe of 0000:01:00.0 failed with error -1
Nov 20 23:13:00 gputest kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
Nov 20 23:13:00 gputest kernel: NVRM: None of the NVIDIA devices were initialized.
Nov 20 23:13:00 gputest kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 245
Nov 20 23:13:00 gputest systemd-udevd[415]: modprobe: ERROR: could not insert 'nvidia_current': No such device
Nov 20 23:13:00 gputest systemd-udevd[405]: Error running install command 'modprobe -i nvidia-current ' for module nvidia: retcode 1
... (repeats couple of times)
Nov 20 23:13:01 gputest kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 243
Nov 20 23:13:01 gputest kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:00.0)
Nov 20 23:13:01 gputest kernel: nvidia: probe of 0000:01:00.0 failed with error -1
Nov 20 23:13:01 gputest kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
Nov 20 23:13:01 gputest kernel: NVRM: None of the NVIDIA devices were initialized.
Nov 20 23:13:01 gputest kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 243
Nov 20 23:13:01 gputest nvidia-persistenced[610]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 106 has read and write permissions for those files.
Nov 20 23:13:01 gputest nvidia-persistenced[610]: Shutdown (610)
Nov 20 23:13:01 gputest nvidia-persistenced[590]: nvidia-persistenced failed to initialize. Check syslog for more details.
Nov 20 23:13:01 gputest systemd[1]: nvidia-persistenced.service: Control process exited, code=exited, status=1/FAILURE
Nov 20 23:13:01 gputest systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
Nov 20 23:13:01 gputest systemd[1]: Failed to start NVIDIA Persistence Daemon.
How it was installed (from official nvidia site):
Code:
wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda
I've tried installing from their
run
script -- it was failing with: Unable to load the kernel module 'nvidia.ko'.
I've tried to install drivers from another
run
script. But it didn't work either with the same error.I've tried to create vm from PopOS iso with preinstalled nvidia drivers -- no luck.
I've tried to install older version on ubuntu server zesty and cuda 9.2 -- same thing.
Anything I can do?
Maybe I forgot something in host configuration, maybe I didn't enable something in vm, maybe my card isn't supported in linux at all?
I don't want to install nvidia drivers, cuda, etc to host machine because nvidia's packages are mess, it's safer and easier (yeah not really) to use vm for this, right?
Last edited: