[SOLVED] Nvidia drivers for Tesla V100 PCIe 32Gb failing to load

demiler

New Member
Nov 20, 2022
11
1
3
Hi, I'm currently battling with nvidia gpu drivers. The issue is -- even though I've successfully was able to passthrough the gpu, no nvidia drivers would work. I'm in dead end rn.

  • PVE version: 7.2-4
  • Linux: Debian 11 (5.15.35-2-pve)
  • GPU: NVIDIA Tesla V100 PCIe 32GB
  • CPU: Intel
  • VT-d: enabled
  • Above 4G Decoding: enabled
  • Kernel args: quiet intel_iommu=on iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:off,efifb:off
  • modprobe modules:
    • blacklist.conf:
      Code:
      blacklist nouveau
      blacklist nvidia
    • iommu_unsafe_interrupts.conf:
      Code:
      options vfio_iommu_type1 allow_unsafe_interrupts=1
    • kvm.conf:
      Code:
      options kvm ignore_msrs=1
    • mdadm.conf:
      Code:
      options md_mod start_ro=1
    • pve-blacklist.conf:
      Code:
      blacklist nvidiafb
  • /etc/modules:
    Code:
    usbhid
    psmouse
    vfio
    vfio_iommu_type1
    vfio_pci
    vfio_virqfd
  • lspci -v (with nvidia part only):
    • Code:
      5e:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
              Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB]
              Flags: fast devsel, IRQ 690, NUMA node 0, IOMMU group 93
              Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
              Memory at 3bf000000000 (64-bit, prefetchable) [size=32G]
              Memory at 3bf800000000 (64-bit, prefetchable) [size=32M]
              Capabilities: [60] Power Management version 3
              Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
              Capabilities: [78] Express Endpoint, MSI 00
              Capabilities: [100] Virtual Channel
              Capabilities: [250] Latency Tolerance Reporting
              Capabilities: [258] L1 PM Substates
              Capabilities: [128] Power Budgeting <?>
              Capabilities: [420] Advanced Error Reporting
              Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
              Capabilities: [900] Secondary PCI Express
              Capabilities: [ac0] Designated Vendor-Specific: Vendor=10de ID=0001 Rev=1 Len=12 <?>
              Kernel driver in use: vfio-pci
              Kernel modules: nvidiafb, nouveau

  • Linux: Debian 11 (5.10.0-19-amd64)
  • nvidia drivers: 520.61.05-1 amd64
  • apt sources:
    Code:
    http://deb.debian.org/debian bullseye
    http://security.debian.org/debian-security bullseye-security
    http://deb.debian.org/debian bullseye-updates
    https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64
  • nvcc --version: cuda_11.8.r11.8/compiler.31833905_0
  • gcc --version: 10.2.1 20210110
  • RAM: 16Gb
  • Bios: OVMF (UEFI)
  • Machine: q35
  • PCI device (gpu): pcie=1
  • kernel args: quiet pci=realloc
  • secure boot: disabled
  • lspci -v (with nvidia part only):
    • Code:
      01:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
              Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB]
              Physical Slot: 0
              Flags: fast devsel, IRQ 16
              Memory at ff000000 (32-bit, non-prefetchable) [size=16M]
              Memory at <ignored> (64-bit, prefetchable)
              Memory at <ignored> (64-bit, prefetchable)
              Capabilities: [60] Power Management version 3
              Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
              Capabilities: [78] Express Endpoint, MSI 00
              Capabilities: [100] Virtual Channel
              Capabilities: [250] Latency Tolerance Reporting
              Capabilities: [128] Power Budgeting <?>
              Capabilities: [420] Advanced Error Reporting
              Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
              Kernel modules: nvidia

Issue. VM with gpu won't load drivers during kernel boot:
Code:
Nov 20 23:13:00 gputest kernel: nvidia: loading out-of-tree module taints kernel.
Nov 20 23:13:00 gputest kernel: nvidia: module license 'NVIDIA' taints kernel.
Nov 20 23:13:00 gputest kernel: Disabling lock debugging due to kernel taint
...
Nov 20 23:13:00 gputest kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Nov 20 23:13:00 gputest kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 245
Nov 20 23:13:00 gputest kernel: nvidia 0000:01:00.0: enabling device (0140 -> 0142)
Nov 20 23:13:00 gputest kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                                NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:00.0)
Nov 20 23:13:00 gputest kernel: nvidia: probe of 0000:01:00.0 failed with error -1
Nov 20 23:13:00 gputest kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
Nov 20 23:13:00 gputest kernel: NVRM: None of the NVIDIA devices were initialized.
Nov 20 23:13:00 gputest kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 245
...
Nov 20 23:13:00 gputest systemd-modules-load[351]: modprobe: ERROR: could not insert 'nvidia_current': No such device
Nov 20 23:13:00 gputest systemd-modules-load[349]: modprobe: ERROR: ../libkmod/libkmod-module.c:990 command_do() Error running install command 'modprobe -i nvidia-current ' for module nvidia: retcode 1
Nov 20 23:13:00 gputest systemd-modules-load[349]: modprobe: ERROR: could not insert 'nvidia': Invalid argument
Nov 20 23:13:00 gputest kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 245
Nov 20 23:13:00 gputest kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                                NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:00.0)
Nov 20 23:13:00 gputest kernel: nvidia: probe of 0000:01:00.0 failed with error -1
Nov 20 23:13:00 gputest kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
Nov 20 23:13:00 gputest kernel: NVRM: None of the NVIDIA devices were initialized.
Nov 20 23:13:00 gputest kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 245
Nov 20 23:13:00 gputest systemd-udevd[415]: modprobe: ERROR: could not insert 'nvidia_current': No such device
Nov 20 23:13:00 gputest systemd-udevd[405]: Error running install command 'modprobe -i nvidia-current ' for module nvidia: retcode 1
... (repeats couple of times)
Nov 20 23:13:01 gputest kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 243
Nov 20 23:13:01 gputest kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                                NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:00.0)
Nov 20 23:13:01 gputest kernel: nvidia: probe of 0000:01:00.0 failed with error -1
Nov 20 23:13:01 gputest kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
Nov 20 23:13:01 gputest kernel: NVRM: None of the NVIDIA devices were initialized.
Nov 20 23:13:01 gputest kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 243
Nov 20 23:13:01 gputest nvidia-persistenced[610]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 106 has read and write permissions for those files.
Nov 20 23:13:01 gputest nvidia-persistenced[610]: Shutdown (610)
Nov 20 23:13:01 gputest nvidia-persistenced[590]: nvidia-persistenced failed to initialize. Check syslog for more details.
Nov 20 23:13:01 gputest systemd[1]: nvidia-persistenced.service: Control process exited, code=exited, status=1/FAILURE
Nov 20 23:13:01 gputest systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
Nov 20 23:13:01 gputest systemd[1]: Failed to start NVIDIA Persistence Daemon.

How it was installed (from official nvidia site):
Code:
wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda

I've tried installing from their run script -- it was failing with: Unable to load the kernel module 'nvidia.ko'.

I've tried to install drivers from another run script. But it didn't work either with the same error.

I've tried to create vm from PopOS iso with preinstalled nvidia drivers -- no luck.

I've tried to install older version on ubuntu server zesty and cuda 9.2 -- same thing.

Anything I can do?
Maybe I forgot something in host configuration, maybe I didn't enable something in vm, maybe my card isn't supported in linux at all?

I don't want to install nvidia drivers, cuda, etc to host machine because nvidia's packages are mess, it's safer and easier (yeah not really) to use vm for this, right?
 
Last edited:
Thanks for reply!

I've tried and got this error in proxmox:
Code:
kvm: -global q35-pci: warning: short-form boolean option 'q35-pci' deprecated
Please use q35-pci=on instead
kvm: -global q35-pci: Invalid parameter 'q35-pci'
TASK ERROR: start failed: QEMU exited with code 1

Changed in /etc/pve/qemu-server/<VMID>.conf from q35-pci to q35-pci=on.
New error occured:
Code:
kvm: -global q35-pci=on: Invalid parameter 'q35-pci'
TASK ERROR: start failed: QEMU exited with code 1

So I removed q35-pci=on. VM started, booted but nvidia drivers still would not work -- same error with BAR1 is 0M @ 0x0 :-(
 
I've created new VM and installed fresh ubuntu 22.04.1 server.

  • RAM: 8Gb
  • CPU: Intel, 8 cores
  • BIOS: OVMF (UEFI)
  • Display: SPICE (qxl)
  • Machine: q35
  • PCI Device: tesla v100, [x] all function; [ ] PCI-e Express;
  • Blacklisted nouveau drivers in modprobe
  • Secure boot: disabled

It looks like memory is not handled properly during passthrough since nvidia drivers report 0 megabyte (in old vm) and lspci (even without any drivers) show this:
Code:
06:10.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
        Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB]
        Physical Slot: 16-2
        Flags: fast devsel, IRQ 11
        Memory at ff000000 (32-bit, non-prefetchable) [disabled] [size=16M]
        Memory at <ignored> (64-bit, prefetchable) [disabled] // <-- Why it says <ignored>?
        Memory at <ignored> (64-bit, prefetchable) [disabled]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Kernel modules: nvidiafb, nouveau
 
Last edited:
ah the quoted part is slightly wrong, it should be:

Code:
qm set VMID -args '-global q35-pcihost.pci-hole64-size=2048G'

note the missing space between q35-pci and host
 
Added pci=realloc to kernel args, still same thing with <ignored>.
Here is what kernel says about this gpu:
Code:
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: [10de:1db6] type 00 class 0x030200
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: reg 0x10: [mem 0xff000000-0xffffffff]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: reg 0x14: [mem 0xfffffff800000000-0xffffffffffffffff 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: reg 0x1c: [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
...
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: can't claim BAR 0 [mem 0xff000000-0xffffffff]: no compatible bridge window
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: can't claim BAR 1 [mem 0xfffffff800000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: can't claim BAR 3 [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
...
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: no space for [mem size 0x800000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: trying firmware assignment [mem 0xfffffff800000000-0xffffffffffffffff 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: [mem 0xfffffff800000000-0xffffffffffffffff 64bit pref] conflicts with PCI mem [mem 0x00000000-0xffffffffff]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: failed to assign [mem size 0x800000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: trying firmware assignment [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref] conflicts with PCI mem [mem 0x00000000-0xffffffffff]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 0: no space for [mem size 0x01000000]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 0: trying firmware assignment [mem 0xff000000-0xffffffff]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 0: assigned [mem 0xff000000-0xffffffff]
...
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: no space for [mem size 0x800000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: trying firmware assignment [mem 0xfffffff800000000-0xffffffffffffffff 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: [mem 0xfffffff800000000-0xffffffffffffffff 64bit pref] conflicts with PCI mem [mem 0x00000000-0xffffffffff]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: failed to assign [mem size 0x800000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: trying firmware assignment [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref] conflicts with PCI mem [mem 0x00000000-0xffffffffff]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]
...
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: no space for [mem size 0x800000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: trying firmware assignment [mem 0xfffffff800000000-0xffffffffffffffff 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: [mem 0xfffffff800000000-0xffffffffffffffff 64bit pref] conflicts with PCI mem [mem 0x00000000-0xffffffffff]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: failed to assign [mem size 0x800000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: trying firmware assignment [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref] conflicts with PCI mem [mem 0x00000000-0xffffffffff]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]

I've also attach more complete log.
 

Attachments

I've added qm set VMID -args '-global q35-pcihost.pci-hole64-size=2048G' and... now it failed at boot:
Code:
Begin: Waiting for root file system...
Begin: Running /scripts/local-block...
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: error opening /dev/md?*: No such file or directory
done.
Gave up waiting for root file system device. Common problems:
  - Boot args (cat /proc/cmdline)
    - Check rootdelay= (did the system wait long enough?)
  - Missing modules (cat /proc/modules; ls /dev)
ALERT! UUID=... does not exist. Dropping to a shell!
 
Look like this is the problem on my end. VM disk rn is on raid bounded disk, I'll create new partition and try again
 
Ok, looks like this weird thing is new ubuntu server shenanigans, I set same parameters for the old debian11 vm and I've got a different error!

Code:
Nov 21 13:06:50 gputest kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 245
Nov 21 13:06:50 gputest kernel:
Nov 21 13:06:50 gputest kernel: nvidia 0000:01:00.0: enabling device (0140 -> 0142)
Nov 21 13:06:50 gputest kernel: NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:1db6)
                                NVRM: installed in this system is not supported by the
                                NVRM: NVIDIA 520.61.05 driver release.
                                NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
                                NVRM: in this release's README, available on the operating system
                                NVRM: specific graphics driver download page at www.nvidia.com.
Nov 21 13:06:50 gputest kernel: nvidia: probe of 0000:01:00.0 failed with error -1
Nov 21 13:06:50 gputest kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
Nov 21 13:06:50 gputest kernel: NVRM: None of the NVIDIA devices were initialized.
Nov 21 13:06:50 gputest kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 245
Nov 21 13:06:51 gputest systemd-modules-load[348]: modprobe: ERROR: could not insert 'nvidia_current': No such device
Nov 21 13:06:51 gputest systemd-modules-load[346]: modprobe: ERROR: ../libkmod/libkmod-module.c:990 command_do() Error running install command 'modprobe -i nvidia-current ' for module nvidia: retcode 1
Nov 21 13:06:51 gputest systemd-modules-load[346]: modprobe: ERROR: could not insert 'nvidia': Invalid argument
Nov 21 13:06:51 gputest kernel: snd_hda_intel 0000:00:1b.0: no codecs found!
Nov 21 13:06:51 gputest systemd[1]: Reached target Sound Card.

So, I guess, I should try an older nvidia drivers version? Any suggestions on distro, distro version, kernel version, nvidia drivers version for Tesla v100 PCIe?
 
Last edited:
Btw, what q35-pcihost.pci-hole64-size=2048G actually does, as I assume it limits gpu memory to 2Gb instead of full 32Gb (I've seen some topic about problems when there is more than 4Gb of vRam, is that it)?
 
it sets the 64 pci hole, see https://en.wikipedia.org/wiki/PCI_hole
2048G is actually 2 Terabyte not 2 Gigabyte

sometimes thats necessary for the driver/card to properly initialize

sadly i don't have such a card here to test, so i cannot say for sure whats wrong
 
Oh ok. Different error message is a huge improvement, thanks (really). I'll try some other driver versions maybe that'll work. I'll report if it does.

(and btw, <ignored> in lspci is now replaced with some actual numbers!! So it's definitely step in the right direction)
 
Last edited:
Soo...... It turns out that the problem was -- VM's BIOS (somehow??). I've stumbled upon this here on nvidia-forum thread. So I've created VM with
  • BIOS: Default (SeaBIOS)
  • Machine: Default (i440fx)
  • Linux: POPOS 22.04 (but I guess any distro will work)
  • PCI: Tesla v100 (as is)
  • no additional configurations to vm, just as is
And it just worked! nvidia-smi just showed me the latest (515) lts nvidia driver and everything!

No idea why legacy bios could solve this but I glad it did.
 
  • Like
Reactions: dcsapak

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!