HP Z840 with Nvidia Tesla P4 GPU Passthrough Stopped Working

inonzur

New Member
Sep 22, 2023
3
0
1
I've been trying the last couple days to sort out the issue with Windows and the Nvidia P4 but not having any luck. Everything works fine in an Ubuntu server. The P4 even shows up in the Windows Device manager (without a code 43 error) and GPUZ, but it doesn't show up in task manager or work with any other software. I've tried several config variations, a clean install of Proxmox 8, clean install of windows 10 and 11, rolled back to kernel 15.15 etc. with the same result. This makes me think it's something in the BIOS but everything was working great for several months then just stopped being recognized earlier this week. Any help would be very much appreciated!

/etc/kernel/cmdline
root=ZFS=rpool/ROOT/pve-1 boot=zfs intel_iommu=on iommu=pt initcall_blacklist=sysfb_init

/etc/modprobe.d/kvm.conf
options kvm ignore_msrs=1 report_ignored_msrs=0

/etc/modprobe.d/nvidia.conf
Code:
blacklist nvidiafb
blacklist nouveau
blacklist nvidia
blacklist nvidia_drm

/etc/modules
Code:
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

lspci -nv -s 84:00.0
Code:
84:00.0 0302: 10de:1bb3 (rev a1)
    Subsystem: 10de:11d8
    Physical Slot: 4
    Flags: fast devsel, IRQ 255, NUMA node 1, IOMMU group 7
    Memory at f8000000 (32-bit, non-prefetchable) [disabled] [size=16M]
    Memory at 3800e0000000 (64-bit, prefetchable) [disabled] [size=256M]
    Memory at 3800f0000000 (64-bit, prefetchable) [disabled] [size=32M]
    Capabilities: [60] Power Management version 3
    Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
    Capabilities: [78] Express Endpoint, MSI 00
    Capabilities: [100] Virtual Channel
    Capabilities: [250] Latency Tolerance Reporting
    Capabilities: [128] Power Budgeting <?>
    Capabilities: [420] Advanced Error Reporting
    Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900] Secondary PCI Express
    Kernel modules: nvidiafb, nouveau

lspci -nn
Code:
00:00.0 Host bridge [0600]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DMI2 [8086:6f00] (rev 01)
00:01.0 PCI bridge [0604]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 1 [8086:6f02] (rev 01)
00:01.1 PCI bridge [0604]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 1 [8086:6f03] (rev 01)
00:02.0 PCI bridge [0604]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 2 [8086:6f04] (rev 01)
00:03.0 PCI bridge [0604]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 3 [8086:6f08] (rev 01)
00:05.0 System peripheral [0880]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Map/VTd_Misc/System Management [8086:6f28] (rev 01)
00:05.1 System peripheral [0880]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D IIO Hot Plug [8086:6f29] (rev 01)
00:05.2 System peripheral [0880]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D IIO RAS/Control Status/Global Errors [8086:6f2a] (rev 01)
00:05.4 PIC [0800]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D I/O APIC [8086:6f2c] (rev 01)
00:11.0 Unassigned class [ff00]: Intel Corporation C610/X99 series chipset SPSR [8086:8d7c] (rev 05)
00:11.4 RAID bus controller [0104]: Intel Corporation C610/X99 series chipset sSATA Controller [RAID mode] [8086:2827] (rev 05)
00:14.0 USB controller [0c03]: Intel Corporation C610/X99 series chipset USB xHCI Host Controller [8086:8d31] (rev 05)
00:16.0 Communication controller [0780]: Intel Corporation C610/X99 series chipset MEI Controller #1 [8086:8d3a] (rev 05)
00:16.3 Serial controller [0700]: Intel Corporation C610/X99 series chipset KT Controller [8086:8d3d] (rev 05)
00:19.0 Ethernet controller [0200]: Intel Corporation Ethernet Connection (2) I218-LM [8086:15a0] (rev 05)
00:1a.0 USB controller [0c03]: Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #2 [8086:8d2d] (rev 05)
00:1b.0 Audio device [0403]: Intel Corporation C610/X99 series chipset HD Audio Controller [8086:8d20] (rev 05)
00:1c.0 PCI bridge [0604]: Intel Corporation C610/X99 series chipset PCI Express Root Port #1 [8086:8d10] (rev d5)
00:1c.3 PCI bridge [0604]: Intel Corporation C610/X99 series chipset PCI Express Root Port #4 [8086:8d16] (rev d5)
00:1c.4 PCI bridge [0604]: Intel Corporation C610/X99 series chipset PCI Express Root Port #5 [8086:8d18] (rev d5)
00:1d.0 USB controller [0c03]: Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #1 [8086:8d26] (rev 05)
00:1f.0 ISA bridge [0601]: Intel Corporation C610/X99 series chipset LPC Controller [8086:8d44] (rev 05)
00:1f.2 RAID bus controller [0104]: Intel Corporation C600/X79 series chipset SATA RAID Controller [8086:2826] (rev 05)
00:1f.3 SMBus [0c05]: Intel Corporation C610/X99 series chipset SMBus Controller [8086:8d22] (rev 05)
01:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 [1000:0087] (rev 05)
02:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller X550 [8086:1563] (rev 01)
02:00.1 Ethernet controller [0200]: Intel Corporation Ethernet Controller X550 [8086:1563] (rev 01)
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP107 [GeForce GTX 1050] [10de:1c81] (rev a1)
05:00.1 Audio device [0403]: NVIDIA Corporation GP107GL High Definition Audio Controller [10de:0fb9] (rev a1)
06:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961/SM963 [144d:a804]
07:00.0 Ethernet controller [0200]: Intel Corporation I210 Gigabit Network Connection [8086:1533] (rev 03)
80:00.0 PCI bridge [0604]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 0 [8086:6f01] (rev 01)
80:01.0 PCI bridge [0604]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 1 [8086:6f02] (rev 01)
80:01.1 PCI bridge [0604]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 1 [8086:6f03] (rev 01)
80:02.0 PCI bridge [0604]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 2 [8086:6f04] (rev 01)
80:03.0 PCI bridge [0604]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 3 [8086:6f08] (rev 01)
80:03.2 PCI bridge [0604]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 3 [8086:6f0a] (rev 01)
80:05.0 System peripheral [0880]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Map/VTd_Misc/System Management [8086:6f28] (rev 01)
80:05.1 System peripheral [0880]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D IIO Hot Plug [8086:6f29] (rev 01)
80:05.2 System peripheral [0880]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D IIO RAS/Control Status/Global Errors [8086:6f2a] (rev 01)
80:05.4 PIC [0800]: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D I/O APIC [8086:6f2c] (rev 01)
84:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
85:00.0 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
85:00.1 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
85:00.2 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
85:00.3 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
86:00.0 USB controller [0c03]: Fresco Logic FL1100 USB 3.0 Host Controller [1b73:1100] (rev 10)

proxmox-boot-tool status
Code:
Re-executing '/usr/sbin/proxmox-boot-tool' in new private mount namespace..
System currently booted with uefi
103C-C426 is configured with: uefi (versions: 5.15.108-1-pve, 6.2.16-12-pve, 6.2.16-3-pve)

dmesg | grep -e DMAR -e IOMMU
Code:
[    0.008546] ACPI: DMAR 0x00000000DBF07000 000148 (v01 HPQOEM SLIC-WKS 00000001 INTL 20091013)
[    0.008573] ACPI: Reserving DMAR table memory at [mem 0xdbf07000-0xdbf07147]
[    0.327326] DMAR: IOMMU enabled
[    0.740510] DMAR: Host address width 46
[    0.740512] DMAR: DRHD base: 0x000000fbffc000 flags: 0x0
[    0.740519] DMAR: dmar0: reg_base_addr fbffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    0.740523] DMAR: DRHD base: 0x000000f7ffd000 flags: 0x0
[    0.740528] DMAR: dmar1: reg_base_addr f7ffd000 ver 1:0 cap 8d2008c10ef0466 ecap f0205b
[    0.740532] DMAR: DRHD base: 0x000000f7ffc000 flags: 0x1
[    0.740541] DMAR: dmar2: reg_base_addr f7ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    0.740545] DMAR: RMRR base: 0x000000daf54000 end: 0x000000daf56fff
[    0.740548] DMAR: ATSR flags: 0x0
[    0.740550] DMAR: ATSR flags: 0x0
[    0.740553] DMAR-IR: IOAPIC id 10 under DRHD base  0xfbffc000 IOMMU 0
[    0.740557] DMAR-IR: IOAPIC id 8 under DRHD base  0xf7ffc000 IOMMU 2
[    0.740559] DMAR-IR: IOAPIC id 9 under DRHD base  0xf7ffc000 IOMMU 2
[    0.740562] DMAR-IR: HPET id 0 under DRHD base 0xf7ffc000
[    0.740564] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[    0.741565] DMAR-IR: Enabled IRQ remapping in x2apic mode
[    2.756160] DMAR: No SATC found
[    2.756165] DMAR: IOMMU feature sc_support inconsistent
[    2.756167] DMAR: IOMMU feature dev_iotlb_support inconsistent
[    2.756169] DMAR: IOMMU feature sc_support inconsistent
[    2.756172] DMAR: IOMMU feature dev_iotlb_support inconsistent
[    2.756174] DMAR: dmar1: Using Queued invalidation
[    2.756182] DMAR: dmar0: Using Queued invalidation
[    2.756187] DMAR: dmar2: Using Queued invalidation
[    2.768490] DMAR: Intel(R) Virtualization Technology for Directed I/O

/etc/pve/qemu-server/911.conf
Code:
agent: 1
balloon: 0
bios: ovmf
boot: order=virtio0
cores: 28
cpu: host,hidden=1,flags=+pcid
efidisk0: zfs-nvme-1tb:vm-911-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: 0000:84:00,pcie=1,x-vga=1
machine: pc-q35-8.0
memory: 16384
meta: creation-qemu=7.0.0,ctime=1666366731
name: w11-p4-lab
net0: virtio=06:94:01:41:E2:AF,bridge=vmbr0,firewall=1
net1: virtio=42:AD:05:DA:36:42,bridge=vmbr20,firewall=1
net2: virtio=76:F4:5F:07:5D:A5,bridge=vmbr50,firewall=1
numa: 0
ostype: win11
scsihw: virtio-scsi-single
smbios1: uuid=6c26a6f8-88b2-4dcf-b0b8-19199bb37693
sockets: 1
tablet: 1
tags: lab
tpmstate0: zfs-nvme-1tb:vm-911-disk-2,size=4M,version=v2.0
virtio0: zfs-nvme-1tb:vm-911-disk-1,discard=on,iothread=1,size=200G
vmgenid: c486c3d3-2420-4d31-a445-5651d063a25e

Screenshot 2023-09-23 at 8.45.34 AM.jpg
 
So from what I'm seeing (correct me if I'm wrong I started proxmox a week ago) the GPU is being passed down for some level to the windows VM, but then the VM is not able to fully use it. How about in an Ubuntu? Have you tried that? Also, what other iommu devices are on that P4 other than the GPU itself? (see audio, usb c hub etc..)

also what kernel modules are loaded by pve?
 
GPU is being passed down for some level to the windows VM, but then the VM is not able to fully use it
You're correct, the windows VM is not able to fully use it.

How about in an Ubuntu?
I'm able to use it without issue in an Ubuntu VM - Jellyfin transcoding for example.

Also, what other iommu devices are on that P4 other than the GPU itself? (see audio, usb c hub etc..)
AFIK there aren't any other IOMMU devices. This is an enterprise card with no video output.
Code:
IOMMU Group 7:
    84:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)

also what kernel modules are loaded by pve?
Code:
root@hp1:~# lsmod | grep -e vfio
vfio_pci               16384  1
vfio_pci_core          94208  1 vfio_pci
irqbypass              16384  58 vfio_pci_core,kvm
vfio_iommu_type1       49152  1
vfio                   57344  7 vfio_pci_core,vfio_iommu_type1,vfio_pci
iommufd                73728  1 vfio
 
i have the same issue with Tesla P4 as well, but on AMD platform (Ryzen 7 5700G on Asrock X470D4U2-2T) everything had been working fine for years. then upgraded PVE to v8. at some point after that i'm seeing the same behavior on a Windows 10 VM, Tesla P4 shows up in device manager with drivers properly loaded, GPU-Z reports it fine, GPU-Z sensors shows GPU clock, memory clock, temperatures, power draw and consumption, voltages, but zero loads.

my first indication of a problem was codeproject.ai stopped using CPU/CUDA for object detection and plex no longer using hw transcoding.

pulling what little hair i have left out trying to figure this out..

edit/update/correction.. it looks like object detection in codeproject.ai using CUDA does actually still work.. getting <45 ms detections using YOLOv5 6.2. so it seems to be just video encode/decode and 3D accelleration not working. so passthru is still working.. just not all GPU features.. makes no sense.
 
Last edited:
i have the same issue with Tesla P4 as well, but on AMD platform (Ryzen 7 5700G on Asrock X470D4U2-2T) everything had been working fine for years. then upgraded PVE to v8. at some point after that i'm seeing the same behavior on a Windows 10 VM, Tesla P4 shows up in device manager with drivers properly loaded, GPU-Z reports it fine, GPU-Z sensors shows GPU clock, memory clock, temperatures, power draw and consumption, voltages, but zero loads.

my first indication of a problem was codeproject.ai stopped using CPU/CUDA for object detection and plex no longer using hw transcoding.

pulling what little hair i have left out trying to figure this out..

edit/update/correction.. it looks like object detection in codeproject.ai using CUDA does actually still work.. getting <45 ms detections using YOLOv5 6.2. so it seems to be just video encode/decode and 3D accelleration not working. so passthru is still working.. just not all GPU features.. makes no sense.

I finally got it to work with the Nvidia GRID drivers. I think these drivers are behind Nvidia's enterprise login so I don't think you can download without an account.

I snagged mine from here:
https://github.com/justin-himself/NVIDIA-VGPU-Driver-Archive/releases/tag/16.1
 
I finally got it to work with the Nvidia GRID drivers. I think these drivers are behind Nvidia's enterprise login so I don't think you can download without an account.

I snagged mine from here:
https://github.com/justin-himself/NVIDIA-VGPU-Driver-Archive/releases/tag/16.1
so that makes me wonder if this isn't some Nvidia drivers tomfoolery going on... again... it was working fine with the 518 driver.. then object detection stopped working in CUDA, so i updated driver to 537. object detection was working again. the whole time drivers looked fine, so it didn't even occur to me that video encode/decode and 3D acceleration weren't working; didn't discover that until i noticed high CPU usage during a Plex transcode.

i have an Nvidia enterprise login. i might try the GRID drivers. though without a GRID license, it's technically against the EULA. i was thinking about moving Plex to an Intel-based physical machine anyway
 
confirmed installing GRID driver package resolved the problem, no other changes made.

i think that pretty much confirms that Nvidia is now limiting full pass thru of all features to enterprise GRID drivers only
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!