I have a AMD W7600 GPU and am trying to pass it through to a vm.
I have gone through the process of passing through a gpu and as far as I can tell everything is working fine when passing the gpu through to a windows 11 vm - the vm has been running without a hitch.
The issue arises when I passthrough the GPU to a Debian 13 vm. Initially, everything works just fine. But after a couple hours the vm will freeze and proxmox will say that the vm had an "internal error".
Relevant information:
My CPU is an amd 3970x threadripper which does not have integrated graphics. As such, proxmox seems to use the gpu when booting up:
However, when I run lspci after boot up I see the drivers in use are vfio-pci. This combined with the windows vm working fine makes me think that proxmox using the gpu at startup is not an issue.
Both of the above pci devices are in their own iommu groups (is this bad? I do check the "All functions" box)
The Debian vm recognizes the GPU, and running radiontop shows it being utilized, so the passthrough is at least partially working. When the Debian vm crashes I see the following error in the proxmox sys logs:
After force stopping the vm I see the following message:
Has anyone worked with a W7600 gpu before and had similar issues?
Does this issue relate to proxmox or is this issue just with the Debian vm itself?
I have gone through the process of passing through a gpu and as far as I can tell everything is working fine when passing the gpu through to a windows 11 vm - the vm has been running without a hitch.
The issue arises when I passthrough the GPU to a Debian 13 vm. Initially, everything works just fine. But after a couple hours the vm will freeze and proxmox will say that the vm had an "internal error".
Relevant information:
My CPU is an amd 3970x threadripper which does not have integrated graphics. As such, proxmox seems to use the gpu when booting up:
Code:
[ 0.387475] pci 0000:03:00.0: [1002:7480] type 00 class 0x030000 PCIe Legacy Endpoint
[ 0.387519] pci 0000:03:00.0: BAR 0 [mem 0x7a380000000-0x7a38fffffff 64bit pref]
[ 0.387524] pci 0000:03:00.0: BAR 2 [mem 0x7a390000000-0x7a3901fffff 64bit pref]
[ 0.387527] pci 0000:03:00.0: BAR 4 [io 0x3000-0x30ff]
[ 0.387530] pci 0000:03:00.0: BAR 5 [mem 0xe6000000-0xe60fffff]
[ 0.387533] pci 0000:03:00.0: ROM [mem 0xe6100000-0xe611ffff pref]
[ 0.387650] pci 0000:03:00.0: PME# supported from D1 D2 D3hot D3cold
[ 0.387854] pci 0000:03:00.1: [1002:ab30] type 00 class 0x040300 PCIe Legacy Endpoint
[ 0.387900] pci 0000:03:00.1: BAR 0 [mem 0xe6120000-0xe6123fff]
[ 0.387981] pci 0000:03:00.1: PME# supported from D1 D2 D3hot D3cold
[ 0.440979] pci 0000:03:00.0: vgaarb: setting as boot VGA device <--------------------------------------------------------
[ 0.440979] pci 0000:03:00.0: vgaarb: bridge control possible
[ 0.440979] pci 0000:03:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[ 0.480902] pci 0000:03:00.1: D0 power state depends on 0000:03:00.0
[ 0.519672] pci 0000:03:00.0: Adding to iommu group 73
[ 0.519726] pci 0000:03:00.1: Adding to iommu group 74
[ 4.884112] vfio-pci 0000:03:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
However, when I run lspci after boot up I see the drivers in use are vfio-pci. This combined with the windows vm working fine makes me think that proxmox using the gpu at startup is not an issue.
Code:
root@services:~# lspci -n -s 03:00 -v
03:00.0 0300: 1002:7480 (prog-if 00 [VGA controller])
Subsystem: 1002:0e0d
Flags: fast devsel, IRQ 11, IOMMU group 73
Memory at 7a380000000 (64-bit, prefetchable) [disabled] [size=256M]
Memory at 7a390000000 (64-bit, prefetchable) [disabled] [size=2M]
I/O ports at 3000 [disabled] [size=256]
Memory at e6000000 (32-bit, non-prefetchable) [disabled] [size=1M]
Expansion ROM at e6100000 [disabled] [size=128K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Capabilities: [64] Express Legacy Endpoint, MSI 00
Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [150] Advanced Error Reporting
Capabilities: [200] Physical Resizable BAR
Capabilities: [240] Power Budgeting <?>
Capabilities: [270] Secondary PCI Express
Capabilities: [2a0] Access Control Services
Capabilities: [2d0] Process Address Space ID (PASID)
Capabilities: [320] Latency Tolerance Reporting
Capabilities: [410] Physical Layer 16.0 GT/s <?>
Capabilities: [450] Lane Margining at the Receiver <?>
Kernel driver in use: vfio-pci
Kernel modules: amdgpu
03:00.1 0403: 1002:ab30
Subsystem: 1002:ab30
Flags: bus master, fast devsel, latency 0, IRQ 10, IOMMU group 74
Memory at e6120000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Capabilities: [64] Express Legacy Endpoint, MSI 00
Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [150] Advanced Error Reporting
Capabilities: [2a0] Access Control Services
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
Both of the above pci devices are in their own iommu groups (is this bad? I do check the "All functions" box)
Code:
pvesh get /nodes/services/hardware/pci --pci-class-blacklist "" | grep 03:00
│ 0x030000 │ 0x7480 │ 0000:03:00.0 │ 73 │ 0x1002 │ Navi 33 [Radeon RX 7700S/7600S] │ │ 0x0e0d │ │ 0x1002 │ Advanced Micro Devices, Inc. [AMD/ATI] │ Advanced Micro Devices, Inc. [AMD/ATI] │
│ 0x040300 │ 0xab30 │ 0000:03:00.1 │ 74 │ 0x1002 │ │ │ 0xab30 │ │ 0x1002 │ Advanced Micro Devices, Inc. [AMD/ATI] │ Advanced Micro Devices, Inc. [AMD/ATI] │
The Debian vm recognizes the GPU, and running radiontop shows it being utilized, so the passthrough is at least partially working. When the Debian vm crashes I see the following error in the proxmox sys logs:
Code:
Nov 23 15:14:05 services QEMU[9823]: error: kvm run failed Bad address
Nov 23 15:14:05 services QEMU[9823]: RAX=ffffffffc0af94f0 RBX=ffff8aa2c5480000 RCX=0000000000000000 RDX=0000000000000000
Nov 23 15:14:05 services QEMU[9823]: RSI=0000000000005482 RDI=ffff8aa2c5480000 RBP=0000000000005482 RSP=ffff9aaf05ebfb30
Nov 23 15:14:05 services QEMU[9823]: R8 =ffff9aaf05ebfcc7 R9 =0000000000000001 R10=000000000000000d R11=000000000000000d
Nov 23 15:14:05 services QEMU[9823]: R12=ffff9aaf00815208 R13=ffff8aa290ea71b0 R14=0000000000000000 R15=ffff8aa290ea6d70
Nov 23 15:14:05 services QEMU[9823]: RIP=ffffffffc087a044 RFL=00000282 [--S----] CPL=0 II=0 A20=1 SMM=0 HLT=0
Nov 23 15:14:05 services QEMU[9823]: ES =0000 0000000000000000 00000000 00000000
Nov 23 15:14:05 services QEMU[9823]: CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
Nov 23 15:14:05 services QEMU[9823]: SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
Nov 23 15:14:05 services QEMU[9823]: DS =0000 0000000000000000 00000000 00000000
Nov 23 15:14:05 services QEMU[9823]: FS =0000 00007f74828881c0 00000000 00000000
Nov 23 15:14:05 services QEMU[9823]: GS =0000 ffff8aa9dfc00000 00000000 00000000
Nov 23 15:14:05 services QEMU[9823]: LDT=0000 fffffe5300000000 00000000 00000000
Nov 23 15:14:05 services QEMU[9823]: TR =0040 fffffe5344d55000 00004087 00008b00 DPL=0 TSS64-busy
Nov 23 15:14:05 services QEMU[9823]: GDT= fffffe5344d53000 0000007f
Nov 23 15:14:05 services QEMU[9823]: IDT= fffffe0000000000 00000fff
Nov 23 15:14:05 services QEMU[9823]: CR0=80050033 CR2=000055c615459000 CR3=0000000103d3e000 CR4=00350ef0
Nov 23 15:14:05 services QEMU[9823]: DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
Nov 23 15:14:05 services QEMU[9823]: DR6=00000000ffff0ff0 DR7=0000000000000400
Nov 23 15:14:05 services QEMU[9823]: EFER=0000000000001d01
Nov 23 15:14:05 services QEMU[9823]: Code=e2 02 75 09 f6 87 38 a2 04 00 10 75 77 4c 03 a3 00 09 00 00 <45> 8b 24 24 eb 12 4c 89 e6 48 8b 87 40 09 00 00 e8 27 cd a8 d8 41 89 c4 66 90 44 89 e0 5b
After force stopping the vm I see the following message:
Code:
Nov 23 15:28:19 services qmeventd[230210]: Starting cleanup for 117
Nov 23 15:28:19 services qmeventd[230210]: Finished cleanup for 117
Nov 23 15:28:19 services kernel: pcieport 0000:02:00.0: broken device, retraining non-functional downstream link at 2.5GT/s
Nov 23 15:28:20 services kernel: pcieport 0000:02:00.0: retraining failed
Nov 23 15:28:20 services kernel: pcieport 0000:02:00.0: Data Link Layer Link Active not set in 1000 msec
Nov 23 15:28:20 services kernel: vfio-pci 0000:03:00.1: Unable to change power state from D0 to D3hot, device inaccessible
Nov 23 15:28:20 services kernel: vfio-pci 0000:03:00.0: Unable to change power state from D0 to D3hot, device inaccessible
Has anyone worked with a W7600 gpu before and had similar issues?
Does this issue relate to proxmox or is this issue just with the Debian vm itself?