Failed to initialize NVML: Unknown Error - GPU passthrough

AxelTwin

Well-Known Member
Oct 10, 2017
133
6
58
39
Hi everybody,

I configured a VM with GPU passthrough on proxmox 7 with tesla T4 card
nvidia-smi stopped working and I can't figure out what went wrong
here is the output I get:

Code:
root@hyperviser:~# nvidia-smi
Failed to initialize NVML: Unknown Error

and some output from demsg:

Code:
root@hyperviser:~# dmesg | grep 5e:00
[    1.040993] pci 0000:5e:00.0: [10de:1eb8] type 00 class 0x030200
[    1.041004] pci 0000:5e:00.0: reg 0x10: [mem 0xb9000000-0xb9ffffff]
[    1.041014] pci 0000:5e:00.0: reg 0x14: [mem 0xbfe0000000-0xbfefffffff 64bit pref]
[    1.041023] pci 0000:5e:00.0: reg 0x1c: [mem 0xbff0000000-0xbff1ffffff 64bit pref]
[    1.041265] pci 0000:5e:00.0: Enabling HDA controller
[    1.041307] pci 0000:5e:00.0: PME# supported from D0 D3hot D3cold
[    1.041334] pci 0000:5e:00.0: reg 0xbf0: [mem 0x00000000-0x0003ffff]
[    1.041335] pci 0000:5e:00.0: VF(n) BAR0 space: [mem 0x00000000-0x003fffff] (contains BAR0 for 16 VFs)
[    1.041344] pci 0000:5e:00.0: reg 0xbf4: [mem 0x00000000-0x0fffffff 64bit pref]
[    1.041346] pci 0000:5e:00.0: VF(n) BAR1 space: [mem 0x00000000-0xffffffff 64bit pref] (contains BAR1 for 16 VFs)
[    1.041354] pci 0000:5e:00.0: reg 0xbfc: [mem 0x00000000-0x01ffffff 64bit pref]
[    1.041356] pci 0000:5e:00.0: VF(n) BAR3 space: [mem 0x00000000-0x1fffffff 64bit pref] (contains BAR3 for 16 VFs)
[    1.041425] pci 0000:5e:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:5d:00.0 (capable of 126.016 Gb/s with 8.0 GT/s PCIe x16 link)
[    1.080219] pnp 00:01: disabling [mem 0xfed1c000-0xfed3ffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080223] pnp 00:01: disabling [mem 0xfed45000-0xfed8bfff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080225] pnp 00:01: disabling [mem 0xff000000-0xffffffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080227] pnp 00:01: disabling [mem 0xfee00000-0xfeefffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080228] pnp 00:01: disabling [mem 0xfed12000-0xfed1200f] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080230] pnp 00:01: disabling [mem 0xfed12010-0xfed1201f] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080231] pnp 00:01: disabling [mem 0xfed1b000-0xfed1bfff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080854] pnp 00:04: disabling [mem 0xfd000000-0xfdabffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080856] pnp 00:04: disabling [mem 0xfdad0000-0xfdadffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080858] pnp 00:04: disabling [mem 0xfdb00000-0xfdffffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080860] pnp 00:04: disabling [mem 0xfe000000-0xfe00ffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080861] pnp 00:04: disabling [mem 0xfe011000-0xfe01ffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080862] pnp 00:04: disabling [mem 0xfe036000-0xfe03bfff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080864] pnp 00:04: disabling [mem 0xfe03d000-0xfe3fffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080865] pnp 00:04: disabling [mem 0xfe410000-0xfe7fffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.094944] pci 0000:5e:00.0: BAR 8: no space for [mem size 0x100000000 64bit pref]
[    1.094946] pci 0000:5e:00.0: BAR 8: failed to assign [mem size 0x100000000 64bit pref]
[    1.094947] pci 0000:5e:00.0: BAR 10: no space for [mem size 0x20000000 64bit pref]
[    1.094949] pci 0000:5e:00.0: BAR 10: failed to assign [mem size 0x20000000 64bit pref]
[    1.094950] pci 0000:5e:00.0: BAR 7: no space for [mem size 0x00400000]
[    1.094951] pci 0000:5e:00.0: BAR 7: failed to assign [mem size 0x00400000]
[    1.094984] pci 0000:5e:00.0: BAR 1: assigned [mem 0xb000000000-0xb00fffffff 64bit pref]
[    1.094991] pci 0000:5e:00.0: BAR 8: assigned [mem 0xb010000000-0xb10fffffff 64bit pref]
[    1.094994] pci 0000:5e:00.0: BAR 3: assigned [mem 0xb110000000-0xb111ffffff 64bit pref]
[    1.095000] pci 0000:5e:00.0: BAR 10: assigned [mem 0xb112000000-0xb131ffffff 64bit pref]
[    1.095003] pci 0000:5e:00.0: BAR 0: assigned [mem 0xbb000000-0xbbffffff]
[    1.095006] pci 0000:5e:00.0: BAR 7: assigned [mem 0xbc000000-0xbc3fffff]
[    1.101725] pci 0000:5e:00.0: Adding to iommu group 79
[    9.247094] NVRM: GPU at 0000:5e:00.0 has software scheduler DISABLED with policy BEST_EFFORT.
[    9.848237] nvidia 0000:5e:00.0: Driver cannot be asked to release device
[    9.848311] nvidia 0000:5e:00.0: MDEV: Registered
[   14.185300] nvidia 0000:5e:00.0: MDEV: Unregistering
[   27.192546] vfio-pci 0000:5e:00.0: vfio_cap_init: hiding cap 0x0@0x68
[   27.192591] vfio-pci 0000:5e:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[   27.192616] vfio-pci 0000:5e:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
root@hyperviser:~# dmesg | grep vfio
[    5.212096] vfio_pci: invalid id string "10de:leb8"
[   10.147214] vfio_mdev d77668c2-36bf-4c12-a6f7-b20b7c214384: Adding to iommu group 143
[   10.147222] vfio_mdev d77668c2-36bf-4c12-a6f7-b20b7c214384: MDEV: group_id = 143
[   14.185487] vfio_mdev d77668c2-36bf-4c12-a6f7-b20b7c214384: Removing from iommu group 143
[   14.185498] vfio_mdev d77668c2-36bf-4c12-a6f7-b20b7c214384: MDEV: detaching iommu
[   27.192546] vfio-pci 0000:5e:00.0: vfio_cap_init: hiding cap 0x0@0x68
[   27.192591] vfio-pci 0000:5e:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[   27.192616] vfio-pci 0000:5e:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
 
Last edited:
Hi everybody,

I configured a VM with GPU passthrough on proxmox 7 with tesla T4 card
nvidia-smi stopped working and I can't figure out what went wrong
here is the output I get:

Code:
root@hyperviser:~# nvidia-smi
Failed to initialize NVML: Unknown Error

and some output from demsg:

Code:
root@hyperviser:~# dmesg | grep 5e:00
[    1.040993] pci 0000:5e:00.0: [10de:1eb8] type 00 class 0x030200
[    1.041004] pci 0000:5e:00.0: reg 0x10: [mem 0xb9000000-0xb9ffffff]
[    1.041014] pci 0000:5e:00.0: reg 0x14: [mem 0xbfe0000000-0xbfefffffff 64bit pref]
[    1.041023] pci 0000:5e:00.0: reg 0x1c: [mem 0xbff0000000-0xbff1ffffff 64bit pref]
[    1.041265] pci 0000:5e:00.0: Enabling HDA controller
[    1.041307] pci 0000:5e:00.0: PME# supported from D0 D3hot D3cold
[    1.041334] pci 0000:5e:00.0: reg 0xbf0: [mem 0x00000000-0x0003ffff]
[    1.041335] pci 0000:5e:00.0: VF(n) BAR0 space: [mem 0x00000000-0x003fffff] (contains BAR0 for 16 VFs)
[    1.041344] pci 0000:5e:00.0: reg 0xbf4: [mem 0x00000000-0x0fffffff 64bit pref]
[    1.041346] pci 0000:5e:00.0: VF(n) BAR1 space: [mem 0x00000000-0xffffffff 64bit pref] (contains BAR1 for 16 VFs)
[    1.041354] pci 0000:5e:00.0: reg 0xbfc: [mem 0x00000000-0x01ffffff 64bit pref]
[    1.041356] pci 0000:5e:00.0: VF(n) BAR3 space: [mem 0x00000000-0x1fffffff 64bit pref] (contains BAR3 for 16 VFs)
[    1.041425] pci 0000:5e:00.0: 63.008 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x8 link at 0000:5d:00.0 (capable of 126.016 Gb/s with 8.0 GT/s PCIe x16 link)
[    1.080219] pnp 00:01: disabling [mem 0xfed1c000-0xfed3ffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080223] pnp 00:01: disabling [mem 0xfed45000-0xfed8bfff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080225] pnp 00:01: disabling [mem 0xff000000-0xffffffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080227] pnp 00:01: disabling [mem 0xfee00000-0xfeefffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080228] pnp 00:01: disabling [mem 0xfed12000-0xfed1200f] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080230] pnp 00:01: disabling [mem 0xfed12010-0xfed1201f] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080231] pnp 00:01: disabling [mem 0xfed1b000-0xfed1bfff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080854] pnp 00:04: disabling [mem 0xfd000000-0xfdabffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080856] pnp 00:04: disabling [mem 0xfdad0000-0xfdadffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080858] pnp 00:04: disabling [mem 0xfdb00000-0xfdffffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080860] pnp 00:04: disabling [mem 0xfe000000-0xfe00ffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080861] pnp 00:04: disabling [mem 0xfe011000-0xfe01ffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080862] pnp 00:04: disabling [mem 0xfe036000-0xfe03bfff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080864] pnp 00:04: disabling [mem 0xfe03d000-0xfe3fffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.080865] pnp 00:04: disabling [mem 0xfe410000-0xfe7fffff] because it overlaps 0000:5e:00.0 BAR 8 [mem 0x00000000-0xffffffff 64bit pref]
[    1.094944] pci 0000:5e:00.0: BAR 8: no space for [mem size 0x100000000 64bit pref]
[    1.094946] pci 0000:5e:00.0: BAR 8: failed to assign [mem size 0x100000000 64bit pref]
[    1.094947] pci 0000:5e:00.0: BAR 10: no space for [mem size 0x20000000 64bit pref]
[    1.094949] pci 0000:5e:00.0: BAR 10: failed to assign [mem size 0x20000000 64bit pref]
[    1.094950] pci 0000:5e:00.0: BAR 7: no space for [mem size 0x00400000]
[    1.094951] pci 0000:5e:00.0: BAR 7: failed to assign [mem size 0x00400000]
[    1.094984] pci 0000:5e:00.0: BAR 1: assigned [mem 0xb000000000-0xb00fffffff 64bit pref]
[    1.094991] pci 0000:5e:00.0: BAR 8: assigned [mem 0xb010000000-0xb10fffffff 64bit pref]
[    1.094994] pci 0000:5e:00.0: BAR 3: assigned [mem 0xb110000000-0xb111ffffff 64bit pref]
[    1.095000] pci 0000:5e:00.0: BAR 10: assigned [mem 0xb112000000-0xb131ffffff 64bit pref]
[    1.095003] pci 0000:5e:00.0: BAR 0: assigned [mem 0xbb000000-0xbbffffff]
[    1.095006] pci 0000:5e:00.0: BAR 7: assigned [mem 0xbc000000-0xbc3fffff]
[    1.101725] pci 0000:5e:00.0: Adding to iommu group 79
[    9.247094] NVRM: GPU at 0000:5e:00.0 has software scheduler DISABLED with policy BEST_EFFORT.
[    9.848237] nvidia 0000:5e:00.0: Driver cannot be asked to release device
[    9.848311] nvidia 0000:5e:00.0: MDEV: Registered
[   14.185300] nvidia 0000:5e:00.0: MDEV: Unregistering
[   27.192546] vfio-pci 0000:5e:00.0: vfio_cap_init: hiding cap 0x0@0x68
[   27.192591] vfio-pci 0000:5e:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[   27.192616] vfio-pci 0000:5e:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
root@hyperviser:~# dmesg | grep vfio
[    5.212096] vfio_pci: invalid id string "10de:leb8"
[   10.147214] vfio_mdev d77668c2-36bf-4c12-a6f7-b20b7c214384: Adding to iommu group 143
[   10.147222] vfio_mdev d77668c2-36bf-4c12-a6f7-b20b7c214384: MDEV: group_id = 143
[   14.185487] vfio_mdev d77668c2-36bf-4c12-a6f7-b20b7c214384: Removing from iommu group 143
[   14.185498] vfio_mdev d77668c2-36bf-4c12-a6f7-b20b7c214384: MDEV: detaching iommu
[   27.192546] vfio-pci 0000:5e:00.0: vfio_cap_init: hiding cap 0x0@0x68
[   27.192591] vfio-pci 0000:5e:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[   27.192616] vfio-pci 0000:5e:00.0: vfio_ecap_init: hiding ecap 0x19@0x900

Hello!

I know your post is from last year but did you solved your problem? How you solved it?
I'm having same trouble with my proxmox and have no clue about it.
Here is my detailed post: https://forum.proxmox.com/threads/n...iled-to-initialize-nvml-unknown-error.135355/

Thanks!
 
Hello!

I know your post is from last year but did you solved your problem? How you solved it?
I'm having same trouble with my proxmox and have no clue about it.
Here is my detailed post: https://forum.proxmox.com/threads/n...iled-to-initialize-nvml-unknown-error.135355/

Thanks!
I recommend reviewing your BIOS settings, such as Advanced Power Management Configuration. Ensure that the Power/Performance Profile selection is set to Virtualization. This adjustment solved the issues I experienced.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!