[HELP] Attempting Quadro K4000 passthrough results in Segfault/QEMU exit code 1 at VM startup

ErinAran

New Member
Nov 5, 2024
2
0
1
Hi, I'm currently running into an issue where I can't pass through a known working GPU (Quadro K4000/GK106GL) to a VM. Whenever I start up a VM with this device passed through, I get QEMU exit code 1 after about 10-30 seconds. As far as I can tell, it doesn't reach the bios.

Here is the output of "journalctl -f" while trying to start a test VM with the video card:
Code:
Nov 04 19:07:33 vault pvedaemon[2729]: start VM 169: UPID:vault:00000AA9:000014D6:67298BF5:qmstart:169:root@pam:
Nov 04 19:07:33 vault pvedaemon[2609]: <root@pam> starting task UPID:vault:00000AA9:000014D6:67298BF5:qmstart:169:root@pam:
Nov 04 19:07:33 vault kernel: vfio-pci 0000:0b:00.0: vgaarb: deactivate vga console
Nov 04 19:07:33 vault kernel: vfio-pci 0000:0b:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
Nov 04 19:07:33 vault kernel: vfio-pci 0000:0b:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
Nov 04 19:07:33 vault kernel: vfio-pci 0000:0b:00.0: vgaarb: deactivate vga console
Nov 04 19:07:33 vault kernel: vfio-pci 0000:0b:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
Nov 04 19:07:33 vault systemd[1]: Created slice qemu.slice - Slice /qemu.
Nov 04 19:07:33 vault systemd[1]: Started 169.scope.
Nov 04 19:07:35 vault kernel: tap169i0: entered promiscuous mode
Nov 04 19:07:35 vault kernel: vmbr0: port 2(fwpr169p0) entered blocking state
Nov 04 19:07:35 vault kernel: vmbr0: port 2(fwpr169p0) entered disabled state
Nov 04 19:07:35 vault kernel: fwpr169p0: entered allmulticast mode
Nov 04 19:07:35 vault kernel: fwpr169p0: entered promiscuous mode
Nov 04 19:07:35 vault kernel: bond0: entered promiscuous mode
Nov 04 19:07:35 vault kernel: ixgbe 0000:0c:00.0 eth11: entered promiscuous mode
Nov 04 19:07:35 vault kernel: ixgbe 0000:0c:00.1 eth12: entered promiscuous mode
Nov 04 19:07:35 vault kernel: ixgbe 0000:0a:00.0 eth13: entered promiscuous mode
Nov 04 19:07:35 vault kernel: ixgbe 0000:0a:00.1 eth14: entered promiscuous mode
Nov 04 19:07:43 vault pvedaemon[2610]: VM 169 qmp command failed - VM 169 qmp command 'query-proxmox-support' failed - got timeout
Nov 04 19:07:49 vault kernel: vmbr0: port 2(fwpr169p0) entered blocking state
Nov 04 19:07:49 vault kernel: vmbr0: port 2(fwpr169p0) entered forwarding state
Nov 04 19:07:49 vault kernel: fwbr169i0: port 1(fwln169i0) entered blocking state
Nov 04 19:07:49 vault kernel: fwbr169i0: port 1(fwln169i0) entered disabled state
Nov 04 19:07:49 vault kernel: fwln169i0: entered allmulticast mode
Nov 04 19:07:49 vault kernel: fwln169i0: entered promiscuous mode
Nov 04 19:07:49 vault kernel: fwbr169i0: port 1(fwln169i0) entered blocking state
Nov 04 19:07:49 vault kernel: fwbr169i0: port 1(fwln169i0) entered forwarding state
Nov 04 19:07:49 vault kernel: fwbr169i0: port 2(tap169i0) entered blocking state
Nov 04 19:07:49 vault kernel: fwbr169i0: port 2(tap169i0) entered disabled state
Nov 04 19:07:49 vault kernel: tap169i0: entered allmulticast mode
Nov 04 19:07:49 vault kernel: fwbr169i0: port 2(tap169i0) entered blocking state
Nov 04 19:07:49 vault kernel: fwbr169i0: port 2(tap169i0) entered forwarding state
Nov 04 19:07:51 vault kernel: show_signal_msg: 13 callbacks suppressed
---
Nov 04 19:07:51 vault kernel: kvm[2762]: segfault at b8 ip 0000559534cd1ba5 sp 00007ffc492f8ab0 error 4 in qemu-system-x86_64[55953491f000+625000] likely on CPU 1 (core 1, socket 0)
Nov 04 19:07:51 vault kernel: Code: 48 85 c0 75 f0 48 8b 6b 60 48 89 b3 80 00 00 00 e8 60 6b 00 00 48 8b 7b 40 83 05 e1 49 b3 00 01 48 85 ff 74 05 e8 5b ea 06 00 <48> 8b 85 b8 00 00 00 48 85 c0 74 7f 8b 93 b0 00 00 00 eb 13 0f 1f
---
Nov 04 19:07:51 vault kernel: fwbr169i0: port 2(tap169i0) entered disabled state
Nov 04 19:07:51 vault kernel: tap169i0 (unregistering): left allmulticast mode
Nov 04 19:07:51 vault kernel: fwbr169i0: port 2(tap169i0) entered disabled state
Nov 04 19:07:51 vault pvedaemon[2729]: start failed: QEMU exited with code 1
Nov 04 19:07:51 vault pvedaemon[2609]: <root@pam> end task UPID:vault:00000AA9:000014D6:67298BF5:qmstart:169:root@pam: start failed: QEMU exited with code 1
Nov 04 19:07:51 vault pvestatd[2578]: VM 169 qmp command failed - VM 169 qmp command 'query-proxmox-support' failed - unable to connect to VM 169 qmp socket - Connection refused
Nov 04 19:07:51 vault pvedaemon[2611]: VM 169 qmp command failed - VM 169 qmp command 'query-proxmox-support' failed - unable to connect to VM 169 qmp socket - Connection refused

The segfault "error code" is exactly as follows, every time:
Code:
48 85 c0 75 f0 48 8b 6b 60 48 89 b3 80 00 00 00 e8 60 6b 00 00 48 8b 7b 40 83 05 e1 49 b3 00 01 48 85 ff 74 05 e8 5b ea 06 00 <48> 8b 85 b8 00 00 00 48 85 c0 74 7f 8b 93 b0 00 00 00 eb 13 0f 1f
The "likely CPU" is random though.

My setup is as follows:

Motherboard: ASUS X99-E WS/USB3.1 (The block diagram for my mobo can be found on page 183 of the manual.)
CPU: Xeon E5-1660V3
RAM: 8x16GB DDR4 ECC 2133

PCIE slots:
1: PCIE SSD (Intel p3600 1.4tb)
2: LSI HBA card in IT mode
3: PCIE SSD (Intel p3600 1.4tb)
4: Empty, previously where my GPU was
5: X520 DA2 NIC
6: Quadro K4000
7: X520 DA2 NIC

Slots 1/2/3 are passed through to a truenas scale VM with no hiccups. Trying to pass through the quadro to any vm while either in slot 4 or 6 results in this error upon startup. All the pci passthrough options in the GUI seem to have no bearing on this issue. ROM-bar, all functions, pci-express w/ Q35 and OVMF, plain old i440fx/seabios, x-vga=0 or 1... No dice on anything.

Proxmox itself is running in UEFI mode.

IOMMU is enabled. The GPU and its HDMI audio controller are both in IOMMU group 37. Nothing else is in said group.

My cmdline options: quiet intel_iommu=on iommu=pt pcie_aspm=off (ASPM is disabled for now because the LSI card spits out a ton of "recovered error" messages otherwise.) In BIOS, VT-d, intel virtualization technology, and ACS are all enabled. I've disabled ASPM everywhere I can here as well, which makes no difference as far as I can tell. I've also blacklisted nvidia and nouveau drivers in /etc/modprobe.d/blacklist.conf

I've beaten the passthrough guide to death. See the following outputs to the commands in it:

lspci -nnk (just the relevant GPU stuff):
Code:
0b:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK106GL [Quadro K4000] [10de:11fa] (rev a1)
        Subsystem: Hewlett-Packard Company GK106GL [Quadro K4000] [103c:079c]
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau
0b:00.1 Audio device [0403]: NVIDIA Corporation GK106 HDMI Audio Controller [10de:0e0b] (rev a1)
        Subsystem: Hewlett-Packard Company GK106 HDMI Audio Controller [103c:079c]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

dmesg | grep -e DMAR -e IOMMU:
Code:
[    0.021251] ACPI: DMAR 0x00000000BB1C0270 0000E4 (v01 ALASKA A M I    00000001 INTL 20091013)
[    0.021292] ACPI: Reserving DMAR table memory at [mem 0xbb1c0270-0xbb1c0353]
[    0.385224] DMAR: IOMMU enabled
[    1.066154] DMAR: Host address width 46
[    1.066156] DMAR: DRHD base: 0x000000fbffd000 flags: 0x0
[    1.066168] DMAR: dmar0: reg_base_addr fbffd000 ver 1:0 cap d2008c10ef0466 ecap f0205b
[    1.066173] DMAR: DRHD base: 0x000000fbffc000 flags: 0x1
[    1.066181] DMAR: dmar1: reg_base_addr fbffc000 ver 1:0 cap d2078c106f0466 ecap f020df
[    1.066185] DMAR: RMRR base: 0x000000bdb73000 end: 0x000000bdb81fff
[    1.066189] DMAR: ATSR flags: 0x0
[    1.066192] DMAR: RHSA base: 0x000000fbffc000 proximity domain: 0x0
[    1.066196] DMAR-IR: IOAPIC id 1 under DRHD base  0xfbffc000 IOMMU 1
[    1.066200] DMAR-IR: IOAPIC id 2 under DRHD base  0xfbffc000 IOMMU 1
[    1.066202] DMAR-IR: HPET id 0 under DRHD base 0xfbffc000
[    1.066205] DMAR-IR: x2apic is disabled because BIOS sets x2apic opt out bit.
[    1.066207] DMAR-IR: Use 'intremap=no_x2apic_optout' to override the BIOS setting.
[    1.067145] DMAR-IR: Enabled IRQ remapping in xapic mode
[    3.761644] DMAR: [Firmware Bug]: RMRR entry for device 13:00.0 is broken - applying workaround
[    3.761651] DMAR: No SATC found
[    3.761654] DMAR: IOMMU feature sc_support inconsistent
[    3.761656] DMAR: IOMMU feature dev_iotlb_support inconsistent
[    3.761658] DMAR: dmar0: Using Queued invalidation
[    3.761672] DMAR: dmar1: Using Queued invalidation
[    3.782899] DMAR: Intel(R) Virtualization Technology for Directed I/O

dmesg | grep 'remapping'
Code:
[    1.067145] DMAR-IR: Enabled IRQ remapping in xapic mode
[    1.067148] x2apic: IRQ remapping doesn't support X2APIC mode

I've tried "intremap=no_x2apic_optout" in the kernel command line. It seems to work as far as enabling x2apic. Rerunning these commands upon reboot indicate as such, but the main problem still persists. I've since rolled it back.

I've tried changing the CPU type to host, QEMU64, x86-64-vxxxx, Haswell. No luck.

Above 4G decoding in BIOS causes the web panel to never load. I have no clue whats going on there.

MCTP in bios does not stay enabled after a reboot. Not sure if it's related, but it's something I've tried since it was alongside ACS under the VT-d section.

All I can think of is that some virtualization/IOMMU-related option in bios isn't actually functioning despite it saying so. Does anyone have thoughts as to what other troubleshooting steps I can take?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!