AMD W7600 Passthrough

correspondent

New Member
Nov 26, 2025
1
0
1
I have a AMD W7600 GPU and am trying to pass it through to a vm.

I have gone through the process of passing through a gpu and as far as I can tell everything is working fine when passing the gpu through to a windows 11 vm - the vm has been running without a hitch.

The issue arises when I passthrough the GPU to a Debian 13 vm. Initially, everything works just fine. But after a couple hours the vm will freeze and proxmox will say that the vm had an "internal error".

Relevant information:

My CPU is an amd 3970x threadripper which does not have integrated graphics. As such, proxmox seems to use the gpu when booting up:
Code:
[    0.387475] pci 0000:03:00.0: [1002:7480] type 00 class 0x030000 PCIe Legacy Endpoint
[    0.387519] pci 0000:03:00.0: BAR 0 [mem 0x7a380000000-0x7a38fffffff 64bit pref]
[    0.387524] pci 0000:03:00.0: BAR 2 [mem 0x7a390000000-0x7a3901fffff 64bit pref]
[    0.387527] pci 0000:03:00.0: BAR 4 [io  0x3000-0x30ff]
[    0.387530] pci 0000:03:00.0: BAR 5 [mem 0xe6000000-0xe60fffff]
[    0.387533] pci 0000:03:00.0: ROM [mem 0xe6100000-0xe611ffff pref]
[    0.387650] pci 0000:03:00.0: PME# supported from D1 D2 D3hot D3cold
[    0.387854] pci 0000:03:00.1: [1002:ab30] type 00 class 0x040300 PCIe Legacy Endpoint
[    0.387900] pci 0000:03:00.1: BAR 0 [mem 0xe6120000-0xe6123fff]
[    0.387981] pci 0000:03:00.1: PME# supported from D1 D2 D3hot D3cold
[    0.440979] pci 0000:03:00.0: vgaarb: setting as boot VGA device   <--------------------------------------------------------
[    0.440979] pci 0000:03:00.0: vgaarb: bridge control possible
[    0.440979] pci 0000:03:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    0.480902] pci 0000:03:00.1: D0 power state depends on 0000:03:00.0
[    0.519672] pci 0000:03:00.0: Adding to iommu group 73
[    0.519726] pci 0000:03:00.1: Adding to iommu group 74
[    4.884112] vfio-pci 0000:03:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none

However, when I run lspci after boot up I see the drivers in use are vfio-pci. This combined with the windows vm working fine makes me think that proxmox using the gpu at startup is not an issue.
Code:
root@services:~# lspci -n -s 03:00 -v
03:00.0 0300: 1002:7480 (prog-if 00 [VGA controller])
        Subsystem: 1002:0e0d
        Flags: fast devsel, IRQ 11, IOMMU group 73
        Memory at 7a380000000 (64-bit, prefetchable) [disabled] [size=256M]
        Memory at 7a390000000 (64-bit, prefetchable) [disabled] [size=2M]
        I/O ports at 3000 [disabled] [size=256]
        Memory at e6000000 (32-bit, non-prefetchable) [disabled] [size=1M]
        Expansion ROM at e6100000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [64] Express Legacy Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [150] Advanced Error Reporting
        Capabilities: [200] Physical Resizable BAR
        Capabilities: [240] Power Budgeting <?>
        Capabilities: [270] Secondary PCI Express
        Capabilities: [2a0] Access Control Services
        Capabilities: [2d0] Process Address Space ID (PASID)
        Capabilities: [320] Latency Tolerance Reporting
        Capabilities: [410] Physical Layer 16.0 GT/s <?>
        Capabilities: [450] Lane Margining at the Receiver <?>
        Kernel driver in use: vfio-pci
        Kernel modules: amdgpu

03:00.1 0403: 1002:ab30
        Subsystem: 1002:ab30
        Flags: bus master, fast devsel, latency 0, IRQ 10, IOMMU group 74
        Memory at e6120000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [64] Express Legacy Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [150] Advanced Error Reporting
        Capabilities: [2a0] Access Control Services
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

Both of the above pci devices are in their own iommu groups (is this bad? I do check the "All functions" box)
Code:
pvesh get /nodes/services/hardware/pci --pci-class-blacklist "" | grep 03:00
│ 0x030000 │ 0x7480 │ 0000:03:00.0 │         73 │ 0x1002 │ Navi 33 [Radeon RX 7700S/7600S]                         │      │ 0x0e0d           │                       │ 0x1002           │ Advanced Micro Devices, Inc. [AMD/ATI] │ Advanced Micro Devices, Inc. [AMD/ATI] │
│ 0x040300 │ 0xab30 │ 0000:03:00.1 │         74 │ 0x1002 │                                                         │      │ 0xab30           │                       │ 0x1002           │ Advanced Micro Devices, Inc. [AMD/ATI] │ Advanced Micro Devices, Inc. [AMD/ATI] │

The Debian vm recognizes the GPU, and running radiontop shows it being utilized, so the passthrough is at least partially working. When the Debian vm crashes I see the following error in the proxmox sys logs:
Code:
Nov 23 15:14:05 services QEMU[9823]: error: kvm run failed Bad address
Nov 23 15:14:05 services QEMU[9823]: RAX=ffffffffc0af94f0 RBX=ffff8aa2c5480000 RCX=0000000000000000 RDX=0000000000000000
Nov 23 15:14:05 services QEMU[9823]: RSI=0000000000005482 RDI=ffff8aa2c5480000 RBP=0000000000005482 RSP=ffff9aaf05ebfb30
Nov 23 15:14:05 services QEMU[9823]: R8 =ffff9aaf05ebfcc7 R9 =0000000000000001 R10=000000000000000d R11=000000000000000d
Nov 23 15:14:05 services QEMU[9823]: R12=ffff9aaf00815208 R13=ffff8aa290ea71b0 R14=0000000000000000 R15=ffff8aa290ea6d70
Nov 23 15:14:05 services QEMU[9823]: RIP=ffffffffc087a044 RFL=00000282 [--S----] CPL=0 II=0 A20=1 SMM=0 HLT=0
Nov 23 15:14:05 services QEMU[9823]: ES =0000 0000000000000000 00000000 00000000
Nov 23 15:14:05 services QEMU[9823]: CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
Nov 23 15:14:05 services QEMU[9823]: SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
Nov 23 15:14:05 services QEMU[9823]: DS =0000 0000000000000000 00000000 00000000
Nov 23 15:14:05 services QEMU[9823]: FS =0000 00007f74828881c0 00000000 00000000
Nov 23 15:14:05 services QEMU[9823]: GS =0000 ffff8aa9dfc00000 00000000 00000000
Nov 23 15:14:05 services QEMU[9823]: LDT=0000 fffffe5300000000 00000000 00000000
Nov 23 15:14:05 services QEMU[9823]: TR =0040 fffffe5344d55000 00004087 00008b00 DPL=0 TSS64-busy
Nov 23 15:14:05 services QEMU[9823]: GDT=     fffffe5344d53000 0000007f
Nov 23 15:14:05 services QEMU[9823]: IDT=     fffffe0000000000 00000fff
Nov 23 15:14:05 services QEMU[9823]: CR0=80050033 CR2=000055c615459000 CR3=0000000103d3e000 CR4=00350ef0
Nov 23 15:14:05 services QEMU[9823]: DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
Nov 23 15:14:05 services QEMU[9823]: DR6=00000000ffff0ff0 DR7=0000000000000400
Nov 23 15:14:05 services QEMU[9823]: EFER=0000000000001d01
Nov 23 15:14:05 services QEMU[9823]: Code=e2 02 75 09 f6 87 38 a2 04 00 10 75 77 4c 03 a3 00 09 00 00 <45> 8b 24 24 eb 12 4c 89 e6 48 8b 87 40 09 00 00 e8 27 cd a8 d8 41 89 c4 66 90 44 89 e0 5b

After force stopping the vm I see the following message:
Code:
Nov 23 15:28:19 services qmeventd[230210]: Starting cleanup for 117
Nov 23 15:28:19 services qmeventd[230210]: Finished cleanup for 117
Nov 23 15:28:19 services kernel: pcieport 0000:02:00.0: broken device, retraining non-functional downstream link at 2.5GT/s
Nov 23 15:28:20 services kernel: pcieport 0000:02:00.0: retraining failed
Nov 23 15:28:20 services kernel: pcieport 0000:02:00.0: Data Link Layer Link Active not set in 1000 msec
Nov 23 15:28:20 services kernel: vfio-pci 0000:03:00.1: Unable to change power state from D0 to D3hot, device inaccessible
Nov 23 15:28:20 services kernel: vfio-pci 0000:03:00.0: Unable to change power state from D0 to D3hot, device inaccessible

Has anyone worked with a W7600 gpu before and had similar issues?
Does this issue relate to proxmox or is this issue just with the Debian vm itself?