Asrock Rack X570D4U-2L2T Geforce RTX3060 GPU passthrough issues

svheel

Member
May 2, 2021
3
0
21
52
Hi everybody,

I've been trying to get GPU passthrough working on the latest Proxmox VE version (8.4.5 right now) and the latest 6.14 kernel (6.14.8-1~bpo12+1), but keep getting segfaults when starting a VM with the GPU passed through.

These are the relevant specs of my system:
Asrock Rack X570D4U-2L2T motherboard
AMD Ryzen 9 5950X
64 Gb RAM
Asus Phoenix Geforce RTX 3060 12g

I followed all standard advise for getting GPU passthrough working. It all seems OK, until I start a VM with GPU passed through.

IOMMU is enabled and working:

dmesg | grep -e DMAR -e IOMMU
[ 0.668197] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[ 0.674759] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).

The command:
pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist ""

Shows the GPU devices (video/audio) in it's own iommu group (gives a large table, not a good idea to post here I think).

I blacklisted the nvidia drivers, enabled vfio modules, which seems to have worked, since lspci -nnk shows:

2d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA106 [GeForce RTX 3060 Lite Hash Rate] [10de:2504] (rev a1)
Subsystem: ASUSTeK Computer Inc. GA106 [GeForce RTX 3060 Lite Hash Rate] [1043:8810]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
2d:00.1 Audio device [0403]: NVIDIA Corporation GA106 High Definition Audio Controller [10de:228e] (rev a1)
Subsystem: ASUSTeK Computer Inc. GA106 High Definition Audio Controller [1043:8810]
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel

The VM config looks good I think:
agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 4
cpu: host
efidisk0: local-lvm:vm-101-disk-0,efitype=4m,pre-enrolled-keys=1,size=4M
hostpci0: 0000:2d:00,pcie=1
ide2: local:iso/Fedora-Server-netinst-x86_64-42-1.1.iso,media=cdrom,size=943370K
machine: q35
memory: 4096
meta: creation-qemu=9.2.0,ctime=1752926522
name: ai
net0: virtio=BC:24:11:C5:CC:56,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-101-disk-1,discard=on,iothread=1,size=128G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=1c103652-aaa9-4724-b46d-2307e239337a
sockets: 1
tpmstate0: local-lvm:vm-101-disk-2,size=4M,version=v2.0
vga: none
vmgenid: 82ba3e2f-63eb-49b6-8e8d-0c4f7517e905

And yet, when I start the VM I immediately get a QEMU error in the GUI:

TASK ERROR: start failed: QEMU exited with code 1

And journalctl shows this:

Jul 19 14:26:42 pve1 pvedaemon[23026]: start VM 101: UPID:pve1:000059F2:00043ABC:687B8F02:qmstart:101:root@pam:
Jul 19 14:26:42 pve1 pvedaemon[2178]: <root@pam> starting task UPID:pve1:000059F2:00043ABC:687B8F02:qmstart:101:root@pam:
Jul 19 14:26:42 pve1 kernel: vfio-pci 0000:2d:00.0: resetting
Jul 19 14:26:42 pve1 kernel: vfio-pci 0000:2d:00.0: reset done
Jul 19 14:26:43 pve1 systemd[1]: Started 101.scope.
Jul 19 14:26:43 pve1 audit[23043]: AVC apparmor="DENIED" operation="capable" class="cap" profile="swtpm" pid=23043 comm="swtpm" capability=21 capname="sys_admin"
Jul 19 14:26:43 pve1 kernel: audit: type=1400 audit(1752928003.044:35): apparmor="DENIED" operation="capable" class="cap" profile="swtpm" pid=23043 comm="swtpm" capability=21 capname="sys_admin"
Jul 19 14:26:43 pve1 kernel: tap101i0: entered promiscuous mode
Jul 19 14:26:43 pve1 kernel: fwbr101i0: port 1(fwln101i0) entered disabled state
Jul 19 14:26:43 pve1 kernel: vmbr0: port 5(fwpr101p0) entered disabled state
Jul 19 14:26:43 pve1 kernel: fwln101i0 (unregistering): left allmulticast mode
Jul 19 14:26:43 pve1 kernel: fwln101i0 (unregistering): left promiscuous mode
Jul 19 14:26:43 pve1 kernel: fwbr101i0: port 1(fwln101i0) entered disabled state
Jul 19 14:26:43 pve1 kernel: fwpr101p0 (unregistering): left allmulticast mode
Jul 19 14:26:43 pve1 kernel: fwpr101p0 (unregistering): left promiscuous mode
Jul 19 14:26:43 pve1 kernel: vmbr0: port 5(fwpr101p0) entered disabled state
Jul 19 14:26:43 pve1 kernel: vmbr0: port 5(fwpr101p0) entered blocking state
Jul 19 14:26:43 pve1 kernel: vmbr0: port 5(fwpr101p0) entered disabled state
Jul 19 14:26:43 pve1 kernel: fwpr101p0: entered allmulticast mode
Jul 19 14:26:43 pve1 kernel: fwpr101p0: entered promiscuous mode
Jul 19 14:26:43 pve1 kernel: vmbr0: port 5(fwpr101p0) entered blocking state
Jul 19 14:26:43 pve1 kernel: vmbr0: port 5(fwpr101p0) entered forwarding state
Jul 19 14:26:43 pve1 kernel: fwbr101i0: port 1(fwln101i0) entered blocking state
Jul 19 14:26:43 pve1 kernel: fwbr101i0: port 1(fwln101i0) entered disabled state
Jul 19 14:26:43 pve1 kernel: fwln101i0: entered allmulticast mode
Jul 19 14:26:43 pve1 kernel: fwln101i0: entered promiscuous mode
Jul 19 14:26:43 pve1 kernel: fwbr101i0: port 1(fwln101i0) entered blocking state
Jul 19 14:26:43 pve1 kernel: fwbr101i0: port 1(fwln101i0) entered forwarding state
Jul 19 14:26:43 pve1 kernel: fwbr101i0: port 2(tap101i0) entered blocking state
Jul 19 14:26:43 pve1 kernel: fwbr101i0: port 2(tap101i0) entered disabled state
Jul 19 14:26:43 pve1 kernel: tap101i0: entered allmulticast mode
Jul 19 14:26:43 pve1 kernel: fwbr101i0: port 2(tap101i0) entered blocking state
Jul 19 14:26:43 pve1 kernel: fwbr101i0: port 2(tap101i0) entered forwarding state
Jul 19 14:26:43 pve1 kernel: vfio-pci 0000:2d:00.0: resetting
Jul 19 14:26:44 pve1 kernel: vfio-pci 0000:2d:00.0: reset done
Jul 19 14:26:44 pve1 kernel: kvm[23048]: segfault at b8 ip 000064aebf9fd9e5 sp 00007ffedd8de640 error 4 in qemu-system-x86_64[7659e5,64aebf5cd000+6ba000] likely on CPU 18 (core 2, socket 0)
Jul 19 14:26:44 pve1 kernel: Code: 48 85 c0 75 f0 48 8b 6b 60 48 89 b3 80 00 00 00 e8 d0 7f 00 00 48 8b 7b 40 83 05 d1 96 30 01 01 48 85 ff 74 05 e8 1b 6d 07 00 <48> 8b 85 b8 00 00 00 48 85 c0 74 7f 8b 93 b0 00 00 00 eb 13 0f 1f
Jul 19 14:26:44 pve1 kernel: fwbr101i0: port 2(tap101i0) entered disabled state
Jul 19 14:26:44 pve1 kernel: tap101i0 (unregistering): left allmulticast mode
Jul 19 14:26:44 pve1 kernel: fwbr101i0: port 2(tap101i0) entered disabled state
Jul 19 14:26:44 pve1 pvedaemon[22245]: VM 101 qmp command failed - VM 101 not running
Jul 19 14:26:44 pve1 pvedaemon[23034]: stopping swtpm instance (pid 23043) due to QEMU startup error
Jul 19 14:26:44 pve1 pvedaemon[23026]: start failed: QEMU exited with code 1
Jul 19 14:26:44 pve1 pvedaemon[2178]: <root@pam> end task UPID:pve1:000059F2:00043ABC:687B8F02:qmstart:101:root@pam: start failed: QEMU exited with code 1
Jul 19 14:26:44 pve1 systemd[1]: 101.scope: Deactivated successfully.

I tried different settings in the GUI for the PCI device (Primary GPU on/of, ROM-Bar on/off, PCI-Express on/off), but the error is always the same.

Does anybody have any clue what I might be doing wrong?
Searching for segfault errors when passing through a GPU on Proxmox on this forum or the Internet doesn't give any helpful results.

Thanks in advance!