[SOLVED] Nvidia drivers for Tesla V100 PCIe 32Gb failing to load

demiler · Nov 20, 2022

Hi, I'm currently battling with nvidia gpu drivers. The issue is -- even though I've successfully was able to passthrough the gpu, no nvidia drivers would work. I'm in dead end rn.

PVE version: 7.2-4
Linux: Debian 11 (5.15.35-2-pve)
GPU: NVIDIA Tesla V100 PCIe 32GB
CPU: Intel
VT-d: enabled
Above 4G Decoding: enabled
Kernel args: quiet intel_iommu=on iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:off,efifb:off

modprobe modules:

blacklist.conf:
Code:
```
blacklist nouveau
blacklist nvidia
```

iommu_unsafe_interrupts.conf:

Code:

options vfio_iommu_type1 allow_unsafe_interrupts=1

kvm.conf:
Code:
```
options kvm ignore_msrs=1
```
mdadm.conf:
Code:
```
options md_mod start_ro=1
```
pve-blacklist.conf:
Code:
```
blacklist nvidiafb
```

/etc/modules:

Code:

usbhid
psmouse
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

lspci -v (with nvidia part only):

Code:

5e:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
        Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB]
        Flags: fast devsel, IRQ 690, NUMA node 0, IOMMU group 93
        Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 3bf000000000 (64-bit, prefetchable) [size=32G]
        Memory at 3bf800000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [258] L1 PM Substates
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Capabilities: [ac0] Designated Vendor-Specific: Vendor=10de ID=0001 Rev=1 Len=12 <?>
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau

Linux: Debian 11 (5.10.0-19-amd64)
nvidia drivers: 520.61.05-1 amd64

apt sources:

Code:

http://deb.debian.org/debian bullseye
http://security.debian.org/debian-security bullseye-security
http://deb.debian.org/debian bullseye-updates
https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64

nvcc --version: cuda_11.8.r11.8/compiler.31833905_0
gcc --version: 10.2.1 20210110
RAM: 16Gb
Bios: OVMF (UEFI)
Machine: q35
PCI device (gpu): pcie=1
kernel args: quiet pci=realloc
secure boot: disabled

lspci -v (with nvidia part only):

Code:

01:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
        Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB]
        Physical Slot: 0
        Flags: fast devsel, IRQ 16
        Memory at ff000000 (32-bit, non-prefetchable) [size=16M]
        Memory at <ignored> (64-bit, prefetchable)
        Memory at <ignored> (64-bit, prefetchable)
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, MSI 00
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Kernel modules: nvidia

Issue. VM with gpu won't load drivers during kernel boot:

Code:

Nov 20 23:13:00 gputest kernel: nvidia: loading out-of-tree module taints kernel.
Nov 20 23:13:00 gputest kernel: nvidia: module license 'NVIDIA' taints kernel.
Nov 20 23:13:00 gputest kernel: Disabling lock debugging due to kernel taint
...
Nov 20 23:13:00 gputest kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Nov 20 23:13:00 gputest kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 245
Nov 20 23:13:00 gputest kernel: nvidia 0000:01:00.0: enabling device (0140 -> 0142)
Nov 20 23:13:00 gputest kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                                NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:00.0)
Nov 20 23:13:00 gputest kernel: nvidia: probe of 0000:01:00.0 failed with error -1
Nov 20 23:13:00 gputest kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
Nov 20 23:13:00 gputest kernel: NVRM: None of the NVIDIA devices were initialized.
Nov 20 23:13:00 gputest kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 245
...
Nov 20 23:13:00 gputest systemd-modules-load[351]: modprobe: ERROR: could not insert 'nvidia_current': No such device
Nov 20 23:13:00 gputest systemd-modules-load[349]: modprobe: ERROR: ../libkmod/libkmod-module.c:990 command_do() Error running install command 'modprobe -i nvidia-current ' for module nvidia: retcode 1
Nov 20 23:13:00 gputest systemd-modules-load[349]: modprobe: ERROR: could not insert 'nvidia': Invalid argument
Nov 20 23:13:00 gputest kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 245
Nov 20 23:13:00 gputest kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                                NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:00.0)
Nov 20 23:13:00 gputest kernel: nvidia: probe of 0000:01:00.0 failed with error -1
Nov 20 23:13:00 gputest kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
Nov 20 23:13:00 gputest kernel: NVRM: None of the NVIDIA devices were initialized.
Nov 20 23:13:00 gputest kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 245
Nov 20 23:13:00 gputest systemd-udevd[415]: modprobe: ERROR: could not insert 'nvidia_current': No such device
Nov 20 23:13:00 gputest systemd-udevd[405]: Error running install command 'modprobe -i nvidia-current ' for module nvidia: retcode 1
... (repeats couple of times)
Nov 20 23:13:01 gputest kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 243
Nov 20 23:13:01 gputest kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
                                NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:00.0)
Nov 20 23:13:01 gputest kernel: nvidia: probe of 0000:01:00.0 failed with error -1
Nov 20 23:13:01 gputest kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
Nov 20 23:13:01 gputest kernel: NVRM: None of the NVIDIA devices were initialized.
Nov 20 23:13:01 gputest kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 243
Nov 20 23:13:01 gputest nvidia-persistenced[610]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 106 has read and write permissions for those files.
Nov 20 23:13:01 gputest nvidia-persistenced[610]: Shutdown (610)
Nov 20 23:13:01 gputest nvidia-persistenced[590]: nvidia-persistenced failed to initialize. Check syslog for more details.
Nov 20 23:13:01 gputest systemd[1]: nvidia-persistenced.service: Control process exited, code=exited, status=1/FAILURE
Nov 20 23:13:01 gputest systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
Nov 20 23:13:01 gputest systemd[1]: Failed to start NVIDIA Persistence Daemon.

How it was installed (from official nvidia site):

Code:

wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda

I've tried installing from their run script -- it was failing with: Unable to load the kernel module 'nvidia.ko'.

I've tried to install drivers from another run script. But it didn't work either with the same error.

I've tried to create vm from PopOS iso with preinstalled nvidia drivers -- no luck.

I've tried to install older version on ubuntu server zesty and cuda 9.2 -- same thing.

Anything I can do?
Maybe I forgot something in host configuration, maybe I didn't enable something in vm, maybe my card isn't supported in linux at all?

I don't want to install nvidia drivers, cuda, etc to host machine because nvidia's packages are mess, it's safer and easier (yeah not really) to use vm for this, right?

dcsapak · Nov 21, 2022

could you try these steps from another user: https://forum.proxmox.com/threads/gpu-passthrough-code-12.110407/#post-504040

trio198 said:
This what works for me :

in grub config, in GRUB_CMDLINE_LINUX, add "pci=realloc" option.
and add this args param in VM :

qm set VMID -args '-global q35-pci host.pci-hole64-size=2048G'

demiler · Nov 21, 2022

Thanks for reply!

I've tried and got this error in proxmox:

Code:

kvm: -global q35-pci: warning: short-form boolean option 'q35-pci' deprecated
Please use q35-pci=on instead
kvm: -global q35-pci: Invalid parameter 'q35-pci'
TASK ERROR: start failed: QEMU exited with code 1

Changed in /etc/pve/qemu-server/<VMID>.conf from q35-pci to q35-pci=on.
New error occured:

Code:

kvm: -global q35-pci=on: Invalid parameter 'q35-pci'
TASK ERROR: start failed: QEMU exited with code 1

So I removed q35-pci=on. VM started, booted but nvidia drivers still would not work -- same error with BAR1 is 0M @ 0x0 :-(

demiler · Nov 21, 2022

I've created new VM and installed fresh ubuntu 22.04.1 server.

RAM: 8Gb
CPU: Intel, 8 cores
BIOS: OVMF (UEFI)
Display: SPICE (qxl)
Machine: q35
PCI Device: tesla v100, [x] all function; [ ] PCI-e Express;
Blacklisted nouveau drivers in modprobe
Secure boot: disabled

It looks like memory is not handled properly during passthrough since nvidia drivers report 0 megabyte (in old vm) and lspci (even without any drivers) show this:

Code:

06:10.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
        Subsystem: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB]
        Physical Slot: 16-2
        Flags: fast devsel, IRQ 11
        Memory at ff000000 (32-bit, non-prefetchable) [disabled] [size=16M]
        Memory at <ignored> (64-bit, prefetchable) [disabled] // <-- Why it says <ignored>?
        Memory at <ignored> (64-bit, prefetchable) [disabled]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Kernel modules: nvidiafb, nouveau

dcsapak · Nov 21, 2022

ah the quoted part is slightly wrong, it should be:

Code:

qm set VMID -args '-global q35-pcihost.pci-hole64-size=2048G'

note the missing space between q35-pci and host

demiler · Nov 21, 2022

Added pci=realloc to kernel args, still same thing with <ignored>.
Here is what kernel says about this gpu:

Code:

Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: [10de:1db6] type 00 class 0x030200
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: reg 0x10: [mem 0xff000000-0xffffffff]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: reg 0x14: [mem 0xfffffff800000000-0xffffffffffffffff 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: reg 0x1c: [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
...
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: can't claim BAR 0 [mem 0xff000000-0xffffffff]: no compatible bridge window
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: can't claim BAR 1 [mem 0xfffffff800000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: can't claim BAR 3 [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
...
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: no space for [mem size 0x800000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: trying firmware assignment [mem 0xfffffff800000000-0xffffffffffffffff 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: [mem 0xfffffff800000000-0xffffffffffffffff 64bit pref] conflicts with PCI mem [mem 0x00000000-0xffffffffff]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: failed to assign [mem size 0x800000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: trying firmware assignment [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref] conflicts with PCI mem [mem 0x00000000-0xffffffffff]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 0: no space for [mem size 0x01000000]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 0: trying firmware assignment [mem 0xff000000-0xffffffff]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 0: assigned [mem 0xff000000-0xffffffff]
...
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: no space for [mem size 0x800000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: trying firmware assignment [mem 0xfffffff800000000-0xffffffffffffffff 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: [mem 0xfffffff800000000-0xffffffffffffffff 64bit pref] conflicts with PCI mem [mem 0x00000000-0xffffffffff]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: failed to assign [mem size 0x800000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: trying firmware assignment [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref] conflicts with PCI mem [mem 0x00000000-0xffffffffff]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]
...
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: no space for [mem size 0x800000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: trying firmware assignment [mem 0xfffffff800000000-0xffffffffffffffff 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: [mem 0xfffffff800000000-0xffffffffffffffff 64bit pref] conflicts with PCI mem [mem 0x00000000-0xffffffffff]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 1: failed to assign [mem size 0x800000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: trying firmware assignment [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref] conflicts with PCI mem [mem 0x00000000-0xffffffffff]
Nov 21 09:33:21 gputest-ubsr kernel: pci 0000:06:10.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]

I've also attach more complete log.

demiler · Nov 21, 2022

I've added qm set VMID -args '-global q35-pcihost.pci-hole64-size=2048G' and... now it failed at boot:

Code:

Begin: Waiting for root file system...
Begin: Running /scripts/local-block...
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: No arrays found in config file or automatically
mdam: error opening /dev/md?*: No such file or directory
done.
Gave up waiting for root file system device. Common problems:
  - Boot args (cat /proc/cmdline)
    - Check rootdelay= (did the system wait long enough?)
  - Missing modules (cat /proc/modules; ls /dev)
ALERT! UUID=... does not exist. Dropping to a shell!

demiler · Nov 21, 2022

Look like this is the problem on my end. VM disk rn is on raid bounded disk, I'll create new partition and try again

demiler · Nov 21, 2022

Moved VM disks to another partition -- same thing, weird

demiler · Nov 21, 2022

Ok, looks like this weird thing is new ubuntu server shenanigans, I set same parameters for the old debian11 vm and I've got a different error!

Code:

Nov 21 13:06:50 gputest kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 245
Nov 21 13:06:50 gputest kernel:
Nov 21 13:06:50 gputest kernel: nvidia 0000:01:00.0: enabling device (0140 -> 0142)
Nov 21 13:06:50 gputest kernel: NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:1db6)
                                NVRM: installed in this system is not supported by the
                                NVRM: NVIDIA 520.61.05 driver release.
                                NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
                                NVRM: in this release's README, available on the operating system
                                NVRM: specific graphics driver download page at www.nvidia.com.
Nov 21 13:06:50 gputest kernel: nvidia: probe of 0000:01:00.0 failed with error -1
Nov 21 13:06:50 gputest kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
Nov 21 13:06:50 gputest kernel: NVRM: None of the NVIDIA devices were initialized.
Nov 21 13:06:50 gputest kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 245
Nov 21 13:06:51 gputest systemd-modules-load[348]: modprobe: ERROR: could not insert 'nvidia_current': No such device
Nov 21 13:06:51 gputest systemd-modules-load[346]: modprobe: ERROR: ../libkmod/libkmod-module.c:990 command_do() Error running install command 'modprobe -i nvidia-current ' for module nvidia: retcode 1
Nov 21 13:06:51 gputest systemd-modules-load[346]: modprobe: ERROR: could not insert 'nvidia': Invalid argument
Nov 21 13:06:51 gputest kernel: snd_hda_intel 0000:00:1b.0: no codecs found!
Nov 21 13:06:51 gputest systemd[1]: Reached target Sound Card.

So, I guess, I should try an older nvidia drivers version? Any suggestions on distro, distro version, kernel version, nvidia drivers version for Tesla v100 PCIe?

demiler · Nov 21, 2022

Btw, what q35-pcihost.pci-hole64-size=2048G actually does, as I assume it limits gpu memory to 2Gb instead of full 32Gb (I've seen some topic about problems when there is more than 4Gb of vRam, is that it)?

dcsapak · Nov 21, 2022

it sets the 64 pci hole, see https://en.wikipedia.org/wiki/PCI_hole
2048G is actually 2 Terabyte not 2 Gigabyte

sometimes thats necessary for the driver/card to properly initialize

sadly i don't have such a card here to test, so i cannot say for sure whats wrong

demiler · Nov 21, 2022

Oh ok. Different error message is a huge improvement, thanks (really). I'll try some other driver versions maybe that'll work. I'll report if it does.

(and btw, <ignored> in lspci is now replaced with some actual numbers!! So it's definitely step in the right direction)

demiler · Nov 22, 2022

Soo...... It turns out that the problem was -- VM's BIOS (somehow??). I've stumbled upon this here on nvidia-forum thread. So I've created VM with

BIOS: Default (SeaBIOS)
Machine: Default (i440fx)
Linux: POPOS 22.04 (but I guess any distro will work)
PCI: Tesla v100 (as is)
no additional configurations to vm, just as is

And it just worked! nvidia-smi just showed me the latest (515) lts nvidia driver and everything!

No idea why legacy bios could solve this but I glad it did.

Search

Search

[SOLVED] Nvidia drivers for Tesla V100 PCIe 32Gb failing to load

demiler

New Member

dcsapak

Proxmox Staff Member

demiler

New Member

demiler

New Member

dcsapak

Proxmox Staff Member

demiler

New Member

Attachments

demiler

New Member

demiler

New Member

demiler

New Member

demiler

New Member

demiler

New Member

dcsapak

Proxmox Staff Member

demiler

New Member

demiler

New Member