Hello everyone,
I am trying to pass through a 6700XT properly. Currently I am stuck in the following state:
When I clean-boot the system, I can see my IOMMUs look fine:
The VFIO driver has taker over the GPUs and their audio functions:
There are no messages on the screen, indicating that the kernel has not allocated the device:
This is what is printed in the log when I start the VM:
Now here is the problem: suppose I shutdown the VM using the GPU. In the host logs I see:
Here is where the problems begin:
If I try to restart the VM now for a second time, I get no signal on my screen (black screen) and in the host logs I see this erro:
This is what I see in the hsot logs when I use the shutdown button (the Qemu Agent picks it up and stops the VM):
So my questions is: how do I fix this? How can I get the VM to display output the second time around?
Note that I can try to restart the VM a one more time, the situation gets much worse. The VM fails to start (I cannot even reach it over the network as it seems to not boot at all) and in the host logs I see:
I am trying to pass through a 6700XT properly. Currently I am stuck in the following state:
- Proxmox boots and I can start a VM using the GPU
- I can use the GPU in the VM quite stable (run BasemarkGPU, Unigine Heaven, Steam games)
- After shutting down the VM I can no longer restart it
- The first attempt to RE-start the VM after it has been shut down gives a few messages like this:
- vfio_bar_restore: reset recovery - restoring BARs
- No GPU output can be seen when this happens - screen is blank
- I can shut down the vm (qm shutdown <id>) even though there is no display output
- The second attempt to RE-start the VM after it has been shut down gives a few messages like this:
- vfio-pci 0000:0b:00.0: timed out waiting for pending transaction; performing function level reset anyway
- vfio-pci 0000:0b:00.0: not ready 1023ms after FLR; waiting
- At this point I need to restart the host to recover.
- If I suspend the host to memory and then resume, I am able to restart the guest VM using that GPU
- Gigabyte X570S Aero G (F4a BIOS, AGESA V2 1.2.0.6 B)
- Gigabyte Aorus 6700 XT Elite 12G
- XFX Radeon RX550
- Ryzen 5950X
- 64GB ECC RAM (Micron 18ASF2G72AZ-2G3B1)
- I have tried with 5.13 (default), 5.15 (opt-in) and 5.16 (edge) with the same results.
- CSM is **DISABLED** in BIOS. I boot Proxmox using UEFI and systemd-boot.
- Command-line used:
Code:
root@pve:~# cat /etc/kernel/cmdline
root=ZFS=rpool/ROOT/pve-1 boot=zfs amd_iommu=on iommu=pt nofb nomodeset video=efifb:off,vesafb:off,vesa:off
root@pve:~# pveversion
pve-manager/7.1-11/8d529482 (running kernel: 5.13.19-6-pve)
- The modules configuration is:
Code:
root@pve:/etc/modprobe.d# cat vfio.conf
options vfio-pci ids=1002:73df,1002:ab28,1002:67df,1002:aaf0,1002:699f,1002:aae0 disable_idle_d3=1 disable_vga=1
root@pve:/etc/modprobe.d# cat pve-blacklist.conf
# This file contains a list of modules which are not supported by Proxmox VE
# nidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb
blacklist amdgpu
blacklist snd_hda_intel
root@pve:/etc/modprobe.d# cat blacklist.conf
blacklist radeon
blacklist nouveau
blacklist nvidia
blacklist amdgpu
When I clean-boot the system, I can see my IOMMUs look fine:
The VFIO driver has taker over the GPUs and their audio functions:
Code:
root@pve:~# lspci -nnk | grep vfio -B2 -A1
0b:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [1002:73df] (rev c1)
Subsystem: Gigabyte Technology Co., Ltd Navi 22 [1458:232e]
Kernel driver in use: vfio-pci
Kernel modules: amdgpu
0b:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:ab28]
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:ab28]
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
0c:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] [1002:699f] (rev c7)
Subsystem: Sapphire Technology Limited Lexa PRO [Radeon RX 550] [1da2:e367]
Kernel driver in use: vfio-pci
Kernel modules: amdgpu
0c:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
Subsystem: Sapphire Technology Limited Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1da2:aae0]
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
There are no messages on the screen, indicating that the kernel has not allocated the device:
Code:
root@pve:~# journalctl -b | egrep "vfio|vendor|BAR|vgaarb|modeset"
Mar 30 01:58:38 pve kernel: Command line: initrd=\EFI\proxmox\5.13.19-6-pve\initrd.img-5.13.19-6-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs amd_iommu=on iommu=pt nofb nomodeset video=efifb:off,vesafb:off,vesa:off
Mar 30 01:58:38 pve kernel: Kernel command line: initrd=\EFI\proxmox\5.13.19-6-pve\initrd.img-5.13.19-6-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs amd_iommu=on iommu=pt nofb nomodeset video=efifb:off,vesafb:off,vesa:off
Mar 30 01:58:38 pve kernel: You have booted with nomodeset. This means your GPU drivers are DISABLED
Mar 30 01:58:38 pve kernel: Unless you actually understand what nomodeset does, you should reboot without enabling it
Mar 30 01:58:38 pve kernel: pci 0000:0c:00.0: BAR 0: assigned to efifb
Mar 30 01:58:38 pve kernel: pci 0000:0b:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
Mar 30 01:58:38 pve kernel: pci 0000:0c:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
Mar 30 01:58:38 pve kernel: pci 0000:0b:00.0: vgaarb: bridge control possible
Mar 30 01:58:38 pve kernel: pci 0000:0c:00.0: vgaarb: bridge control possible
Mar 30 01:58:38 pve kernel: pci 0000:0c:00.0: vgaarb: setting as boot device
Mar 30 01:58:38 pve kernel: vgaarb: loaded
Mar 30 01:58:38 pve kernel: vfio-pci 0000:0b:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
Mar 30 01:58:38 pve kernel: vfio_pci: add [1002:73df[ffffffff:ffffffff]] class 0x000000/00000000
Mar 30 01:58:38 pve kernel: vfio_pci: add [1002:ab28[ffffffff:ffffffff]] class 0x000000/00000000
Mar 30 01:58:38 pve kernel: vfio_pci: add [1002:67df[ffffffff:ffffffff]] class 0x000000/00000000
Mar 30 01:58:38 pve kernel: vfio_pci: add [1002:aaf0[ffffffff:ffffffff]] class 0x000000/00000000
Mar 30 01:58:38 pve kernel: vfio-pci 0000:0c:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
Mar 30 01:58:38 pve kernel: vfio_pci: add [1002:699f[ffffffff:ffffffff]] class 0x000000/00000000
Mar 30 01:58:38 pve kernel: vfio_pci: add [1002:aae0[ffffffff:ffffffff]] class 0x000000/00000000
Mar 30 01:58:38 pve systemd-modules-load[1822]: Inserted module 'vfio'
Mar 30 01:58:38 pve systemd-modules-load[1822]: Inserted module 'vfio_pci'
Mar 30 01:58:38 pve systemd-modules-load[1822]: Inserted module 'vendor_reset'
Mar 30 01:58:39 pve kernel: vendor_reset_hook: installed
AT THIS POINT I CAN START A VM WHICH USES THE RX6700XT SUCCESSFULLY.
- Q35 machine
- Host CPU type
- OVMF BIOS
This is what is printed in the log when I start the VM:
Code:
Mar 30 02:09:04 pve pvedaemon[14713]: start VM 102: UPID:pve:00003979:0000F5E6:6243ADB0:qmstart:102:root@pam:
Mar 30 02:09:04 pve pvedaemon[3401]: <root@pam> starting task UPID:pve:00003979:0000F5E6:6243ADB0:qmstart:102:root@pam:
Mar 30 02:09:04 pve systemd[1]: Created slice qemu.slice.
Mar 30 02:09:04 pve systemd[1]: Started 102.scope.
Mar 30 02:09:04 pve systemd-udevd[14727]: Using default interface naming scheme 'v247'.
Mar 30 02:09:04 pve systemd-udevd[14727]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Mar 30 02:09:04 pve kernel: device tap102i0 entered promiscuous mode
Mar 30 02:09:04 pve kernel: vmbr0: port 2(tap102i0) entered blocking state
Mar 30 02:09:04 pve kernel: vmbr0: port 2(tap102i0) entered disabled state
Mar 30 02:09:04 pve kernel: vmbr0: port 2(tap102i0) entered blocking state
Mar 30 02:09:04 pve kernel: vmbr0: port 2(tap102i0) entered forwarding state
Mar 30 02:09:06 pve kernel: vfio-pci 0000:0b:00.0: enabling device (0002 -> 0003)
Mar 30 02:09:06 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
Mar 30 02:09:06 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
Mar 30 02:09:06 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
Mar 30 02:09:06 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
Mar 30 02:09:06 pve kernel: vfio-pci 0000:0b:00.1: enabling device (0000 -> 0002)
Mar 30 02:09:07 pve pvedaemon[3401]: <root@pam> end task UPID:pve:00003979:0000F5E6:6243ADB0:qmstart:102:root@pam: OK
Now here is the problem: suppose I shutdown the VM using the GPU. In the host logs I see:
Code:
Mar 30 02:10:30 pve pvedaemon[24579]: shutdown VM 102: UPID:pve:00006003:0001179A:6243AE06:qmshutdown:102:root@pam:
Mar 30 02:10:30 pve pvedaemon[3401]: <root@pam> starting task UPID:pve:00006003:0001179A:6243AE06:qmshutdown:102:root@pam:
Mar 30 02:10:32 pve QEMU[14722]: kvm: terminating on signal 15 from pid 2536 (/usr/sbin/qmeventd)
Mar 30 02:10:32 pve kernel: zd160: p1 p2
Mar 30 02:10:32 pve kernel: vmbr0: port 2(tap102i0) entered disabled state
Mar 30 02:10:32 pve pvedaemon[3401]: <root@pam> end task UPID:pve:00006003:0001179A:6243AE06:qmshutdown:102:root@pam: OK
Mar 30 02:10:32 pve qmeventd[25228]: Starting cleanup for 102
Mar 30 02:10:32 pve qmeventd[25228]: Finished cleanup for 102
Mar 30 02:10:33 pve systemd[1]: 102.scope: Succeeded.
Mar 30 02:10:33 pve systemd[1]: 102.scope: Consumed 37.892s CPU time.
Here is where the problems begin:
If I try to restart the VM now for a second time, I get no signal on my screen (black screen) and in the host logs I see this erro:
Code:
Mar 30 02:12:36 pve pvedaemon[26528]: start VM 102: UPID:pve:000067A0:000148D0:6243AE84:qmstart:102:root@pam:
Mar 30 02:12:36 pve pvedaemon[3400]: <root@pam> starting task UPID:pve:000067A0:000148D0:6243AE84:qmstart:102:root@pam:
Mar 30 02:12:36 pve systemd[1]: Started 102.scope.
Mar 30 02:12:36 pve systemd-udevd[26542]: Using default interface naming scheme 'v247'.
Mar 30 02:12:36 pve systemd-udevd[26542]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Mar 30 02:12:36 pve kernel: device tap102i0 entered promiscuous mode
Mar 30 02:12:36 pve kernel: vmbr0: port 2(tap102i0) entered blocking state
Mar 30 02:12:36 pve kernel: vmbr0: port 2(tap102i0) entered disabled state
Mar 30 02:12:36 pve kernel: vmbr0: port 2(tap102i0) entered blocking state
Mar 30 02:12:36 pve kernel: vmbr0: port 2(tap102i0) entered forwarding state
Mar 30 02:12:38 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
Mar 30 02:12:38 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
Mar 30 02:12:38 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
Mar 30 02:12:38 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
Mar 30 02:12:39 pve kernel: vfio-pci 0000:0b:00.1: vfio_bar_restore: reset recovery - restoring BARs
Mar 30 02:12:39 pve kernel: vfio-pci 0000:0b:00.0: vfio_bar_restore: reset recovery - restoring BARs
Mar 30 02:12:39 pve pvedaemon[3400]: <root@pam> end task UPID:pve:000067A0:000148D0:6243AE84:qmstart:102:root@pam: OK
Mar 30 02:12:41 pve kernel: vfio-pci 0000:0b:00.0: vfio_bar_restore: reset recovery - restoring BARs
... MANY TIMES
Mar 30 02:12:52 pve kernel: vfio-pci 0000:0b:00.0: vfio_bar_restore: reset recovery - restoring BARs
Note that even though there is no display output, the VM is operational. I can ssh into the VM over the network and run commands.
This is what I see in the hsot logs when I use the shutdown button (the Qemu Agent picks it up and stops the VM):
Code:
Mar 30 02:15:41 pve pvedaemon[33036]: shutdown VM 102: UPID:pve:0000810C:000190DA:6243AF3D:qmshutdown:102:root@pam:
Mar 30 02:15:41 pve pvedaemon[3401]: <root@pam> starting task UPID:pve:0000810C:000190DA:6243AF3D:qmshutdown:102:root@pam:
Mar 30 02:15:41 pve QEMU[26537]: kvm: terminating on signal 15 from pid 2536 (/usr/sbin/qmeventd)
Mar 30 02:15:41 pve kernel: zd160: p1 p2
Mar 30 02:15:41 pve kernel: vmbr0: port 2(tap102i0) entered disabled state
Mar 30 02:15:42 pve pvedaemon[3401]: <root@pam> end task UPID:pve:0000810C:000190DA:6243AF3D:qmshutdown:102:root@pam: OK
Mar 30 02:15:42 pve qmeventd[33542]: Starting cleanup for 102
Mar 30 02:15:42 pve qmeventd[33542]: Finished cleanup for 102
Mar 30 02:15:43 pve systemd[1]: 102.scope: Succeeded.
Mar 30 02:15:43 pve systemd[1]: 102.scope: Consumed 2min 7.869s CPU time.
So my questions is: how do I fix this? How can I get the VM to display output the second time around?
Note that I can try to restart the VM a one more time, the situation gets much worse. The VM fails to start (I cannot even reach it over the network as it seems to not boot at all) and in the host logs I see:
Code:
Mar 30 02:17:23 pve pvedaemon[34628]: start VM 102: UPID:pve:00008744:0001B90E:6243AFA3:qmstart:102:root@pam:
Mar 30 02:17:23 pve pvedaemon[3402]: <root@pam> starting task UPID:pve:00008744:0001B90E:6243AFA3:qmstart:102:root@pam:
Mar 30 02:17:24 pve kernel: vfio-pci 0000:0b:00.0: timed out waiting for pending transaction; performing function level reset anyway
Mar 30 02:17:25 pve kernel: vfio-pci 0000:0b:00.0: not ready 1023ms after FLR; waiting
Mar 30 02:17:27 pve kernel: vfio-pci 0000:0b:00.0: not ready 2047ms after FLR; waiting
Mar 30 02:17:29 pve kernel: vfio-pci 0000:0b:00.0: not ready 4095ms after FLR; waiting
Mar 30 02:17:33 pve kernel: vfio-pci 0000:0b:00.0: not ready 8191ms after FLR; waiting
Mar 30 02:17:42 pve kernel: vfio-pci 0000:0b:00.0: not ready 16383ms after FLR; waiting
Mar 30 02:17:59 pve kernel: vfio-pci 0000:0b:00.0: not ready 32767ms after FLR; waiting
Mar 30 02:18:34 pve kernel: vfio-pci 0000:0b:00.0: not ready 65535ms after FLR; giving up
Mar 30 02:18:34 pve systemd[1]: Started 102.scope.
Mar 30 02:18:34 pve systemd-udevd[35658]: Using default interface naming scheme 'v247'.
Mar 30 02:18:34 pve systemd-udevd[35658]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Mar 30 02:18:34 pve kernel: device tap102i0 entered promiscuous mode
Mar 30 02:18:34 pve kernel: vmbr0: port 2(tap102i0) entered blocking state
Mar 30 02:18:34 pve kernel: vmbr0: port 2(tap102i0) entered disabled state
Mar 30 02:18:34 pve kernel: vmbr0: port 2(tap102i0) entered blocking state
Mar 30 02:18:34 pve kernel: vmbr0: port 2(tap102i0) entered forwarding state
Mar 30 02:18:36 pve kernel: vfio-pci 0000:0b:00.0: can't change power state from D3cold to D0 (config space inaccessible)
Mar 30 02:18:36 pve kernel: vfio-pci 0000:0b:00.0: can't change power state from D3cold to D0 (config space inaccessible)
Mar 30 02:18:36 pve kernel: vfio-pci 0000:0b:00.0: timed out waiting for pending transaction; performing function level reset anyway
Mar 30 02:18:37 pve kernel: vfio-pci 0000:0b:00.0: not ready 1023ms after FLR; waiting
Mar 30 02:18:39 pve kernel: vfio-pci 0000:0b:00.0: not ready 2047ms after FLR; waiting
Mar 30 02:18:41 pve kernel: vfio-pci 0000:0b:00.0: not ready 4095ms after FLR; waiting
Mar 30 02:18:41 pve pvedaemon[3400]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 31 retries
...
Mar 30 02:19:45 pve kernel: vfio-pci 0000:0b:00.0: not ready 65535ms after FLR; giving up
Mar 30 02:19:45 pve kernel: vfio-pci 0000:0b:00.0: vfio_cap_init: hiding cap 0xff@0xff
... MANY TIMES
Mar 30 02:19:45 pve kernel: vfio-pci 0000:0b:00.0: vfio_cap_init: hiding cap 0xff@0xff
Mar 30 02:19:45 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0xffff@0x100
Mar 30 02:19:45 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0xffff@0xffc
... MANY TIMES
Mar 30 02:19:45 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0xffff@0xffc
Mar 30 02:19:45 pve kernel: vmbr0: port 2(tap102i0) entered disabled state
Mar 30 02:19:45 pve kernel: vmbr0: port 2(tap102i0) entered disabled state
Mar 30 02:19:45 pve kernel: vfio-pci 0000:0b:00.0: can't change power state from D3cold to D0 (config space inaccessible)
Mar 30 02:19:46 pve systemd[1]: 102.scope: Succeeded.
Mar 30 02:19:46 pve systemd[1]: 102.scope: Consumed 1.906s CPU time.
Last edited: