Proxmox and Nvidia drivers help needed

christopher_init

New Member
Mar 8, 2022
2
0
1
34
Hello all,

Environment: pve v7.4.16
System: HP Z2 G4, Intel Xeon E-2278G, Intel iGPU
Dedicated GPU: GTX 1650

I am passing through the Intel iGPU 630 to a Windows VM. That works as expected.
I am trying to now use my dedicated GPU on the host for LXC containers. This is the only device in my PCIe slots.
I went into the BIOS and checked "Integrated Video" and set my Primary Video as my GTX 1650.
I made sure each device is in its own IOMMU group. I blacklisted nouveau and installed the lasted Nvidia drivers for my card.

IOMMU groups (GPU is group 1, intel iGPU is group 2):
Code:
Jul 19 13:25:37 pve kernel: [    0.567365] pci 0000:00:00.0: Adding to iommu group 0
Jul 19 13:25:37 pve kernel: [    0.567380] pci 0000:00:01.0: Adding to iommu group 1
Jul 19 13:25:37 pve kernel: [    0.567389] pci 0000:00:02.0: Adding to iommu group 2
Jul 19 13:25:37 pve kernel: [    0.567400] pci 0000:00:12.0: Adding to iommu group 3
Jul 19 13:25:37 pve kernel: [    0.567417] pci 0000:00:14.0: Adding to iommu group 4
Jul 19 13:25:37 pve kernel: [    0.567425] pci 0000:00:14.2: Adding to iommu group 4
Jul 19 13:25:37 pve kernel: [    0.567436] pci 0000:00:16.0: Adding to iommu group 5
Jul 19 13:25:37 pve kernel: [    0.567444] pci 0000:00:17.0: Adding to iommu group 6
Jul 19 13:25:37 pve kernel: [    0.567461] pci 0000:00:1b.0: Adding to iommu group 7
Jul 19 13:25:37 pve kernel: [    0.567478] pci 0000:00:1b.4: Adding to iommu group 8
Jul 19 13:25:37 pve kernel: [    0.567489] pci 0000:00:1d.0: Adding to iommu group 9
Jul 19 13:25:37 pve kernel: [    0.567512] pci 0000:00:1f.0: Adding to iommu group 10
Jul 19 13:25:37 pve kernel: [    0.567520] pci 0000:00:1f.3: Adding to iommu group 10
Jul 19 13:25:37 pve kernel: [    0.567529] pci 0000:00:1f.4: Adding to iommu group 10
Jul 19 13:25:37 pve kernel: [    0.567538] pci 0000:00:1f.5: Adding to iommu group 10
Jul 19 13:25:37 pve kernel: [    0.567546] pci 0000:00:1f.6: Adding to iommu group 10
Jul 19 13:25:37 pve kernel: [    0.567551] pci 0000:01:00.0: Adding to iommu group 1
Jul 19 13:25:37 pve kernel: [    0.567556] pci 0000:01:00.1: Adding to iommu group 1
Jul 19 13:25:37 pve kernel: [    0.567571] pci 0000:02:00.0: Adding to iommu group 11
Jul 19 13:25:37 pve kernel: [    0.567583] pci 0000:03:00.0: Adding to iommu group 12
Jul 19 13:25:37 pve kernel: [    0.567594] pci 0000:04:00.0: Adding to iommu group 13

The installer works as expected. What doesn't work is nvidia-smi, it will either display no device found, or throw some other error like this:

Code:
Jul 19 13:04:23 pve kernel: [    5.414677] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.86.05  Fri Jul 14 20:20:58 UTC 2023
Jul 19 13:04:23 pve kernel: [    5.431577] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Jul 19 13:04:23 pve kernel: [    5.431580] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
Jul 19 13:04:23 pve kernel: [    5.463583] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
Jul 19 13:04:23 pve kernel: [    5.468324] nvidia-uvm: Loaded the UVM driver, major device number 507.
Jul 19 13:04:23 pve kernel: [    5.472258] Loading iSCSI transport class v2.0-870.
Jul 19 13:04:23 pve kernel: [    5.475025] iscsi: registered transport (tcp)
Jul 19 13:04:23 pve kernel: [    5.619771] pcieport 0000:00:01.0: AER: Multiple Uncorrected (Fatal) error received: 0000:00:01.0
Jul 19 13:04:23 pve kernel: [    5.667909] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
Jul 19 13:04:23 pve kernel: [    5.667932] pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
Jul 19 13:04:23 pve kernel: [    5.667963] pcieport 0000:00:01.0:    [14] CmpltTO                (First)
Jul 19 13:04:23 pve kernel: [    5.667977] nvidia 0000:01:00.0: AER: can't recover (no error_detected callback)
Jul 19 13:04:23 pve kernel: [    5.667978] snd_hda_intel 0000:01:00.1: AER: can't recover (no error_detected callback)
Jul 19 13:04:23 pve kernel: [    5.667985] NVRM: GPU at PCI:0000:01:00: GPU-a23264ce-e466-644b-6346-43552a404d5c
Jul 19 13:04:23 pve kernel: [    5.667987] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Jul 19 13:04:23 pve kernel: [    5.667989] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Jul 19 13:04:23 pve kernel: [    5.668180] NVRM: A GPU crash dump has been created. If possible, please run
Jul 19 13:04:23 pve kernel: [    5.668180] NVRM: nvidia-bug-report.sh as root to collect this data before
Jul 19 13:04:23 pve kernel: [    5.668180] NVRM: the NVIDIA kernel module is unloaded.

Now, what does work is driver 470.199.02 and it works flawlessly, I am able to reboot, issue commands like nvidia-smi and run other applications. The issue is, the application I am trying to run wont run correctly with this older driver. It does work correctly with the latest driver which is v535.

I have tried to install multiple drivers > 470.199.02 and most of them do not work properly.
 
Last edited:
I couldn't figure it out. I couldn't get any newer Nvidia drivers > 470.199.02 working on the pve host or lxc. I ended up passing the GPU through to Ubuntu which then allowed me to use the latest Nvidia drivers. Not ideal but it works for my needs.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!