vfio_bar_restore: reset recovery - restoring BARs

karypid

Member
Mar 7, 2021
30
8
13
46
Hello everyone,

I am trying to pass through a 6700XT properly. Currently I am stuck in the following state:

  • Proxmox boots and I can start a VM using the GPU
  • I can use the GPU in the VM quite stable (run BasemarkGPU, Unigine Heaven, Steam games)
  • After shutting down the VM I can no longer restart it
  • The first attempt to RE-start the VM after it has been shut down gives a few messages like this:
    • vfio_bar_restore: reset recovery - restoring BARs
  • No GPU output can be seen when this happens - screen is blank
  • I can shut down the vm (qm shutdown <id>) even though there is no display output
  • The second attempt to RE-start the VM after it has been shut down gives a few messages like this:
    • vfio-pci 0000:0b:00.0: timed out waiting for pending transaction; performing function level reset anyway
    • vfio-pci 0000:0b:00.0: not ready 1023ms after FLR; waiting
  • At this point I need to restart the host to recover.
  • If I suspend the host to memory and then resume, I am able to restart the guest VM using that GPU
System:
Boot/Kernel:
  • I have tried with 5.13 (default), 5.15 (opt-in) and 5.16 (edge) with the same results.
  • CSM is **DISABLED** in BIOS. I boot Proxmox using UEFI and systemd-boot.
  • Command-line used:
Code:
root@pve:~# cat /etc/kernel/cmdline
root=ZFS=rpool/ROOT/pve-1 boot=zfs amd_iommu=on iommu=pt nofb nomodeset video=efifb:off,vesafb:off,vesa:off

root@pve:~# pveversion
pve-manager/7.1-11/8d529482 (running kernel: 5.13.19-6-pve)
  • The modules configuration is:
Code:
root@pve:/etc/modprobe.d# cat vfio.conf
options vfio-pci ids=1002:73df,1002:ab28,1002:67df,1002:aaf0,1002:699f,1002:aae0 disable_idle_d3=1 disable_vga=1

root@pve:/etc/modprobe.d# cat pve-blacklist.conf
# This file contains a list of modules which are not supported by Proxmox VE

# nidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
blacklist nvidiafb
blacklist amdgpu
blacklist snd_hda_intel

root@pve:/etc/modprobe.d# cat blacklist.conf
blacklist radeon
blacklist nouveau
blacklist nvidia
blacklist amdgpu

When I clean-boot the system, I can see my IOMMUs look fine:

The VFIO driver has taker over the GPUs and their audio functions:

Code:
root@pve:~# lspci -nnk | grep vfio -B2 -A1
0b:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [1002:73df] (rev c1)
        Subsystem: Gigabyte Technology Co., Ltd Navi 22 [1458:232e]
        Kernel driver in use: vfio-pci
        Kernel modules: amdgpu
0b:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:ab28]
        Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:ab28]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel
0c:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] [1002:699f] (rev c7)
        Subsystem: Sapphire Technology Limited Lexa PRO [Radeon RX 550] [1da2:e367]
        Kernel driver in use: vfio-pci
        Kernel modules: amdgpu
0c:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1002:aae0]
        Subsystem: Sapphire Technology Limited Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] [1da2:aae0]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

There are no messages on the screen, indicating that the kernel has not allocated the device:

Code:
root@pve:~# journalctl -b | egrep "vfio|vendor|BAR|vgaarb|modeset"
Mar 30 01:58:38 pve kernel: Command line: initrd=\EFI\proxmox\5.13.19-6-pve\initrd.img-5.13.19-6-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs amd_iommu=on iommu=pt nofb nomodeset video=efifb:off,vesafb:off,vesa:off
Mar 30 01:58:38 pve kernel: Kernel command line: initrd=\EFI\proxmox\5.13.19-6-pve\initrd.img-5.13.19-6-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs amd_iommu=on iommu=pt nofb nomodeset video=efifb:off,vesafb:off,vesa:off
Mar 30 01:58:38 pve kernel: You have booted with nomodeset. This means your GPU drivers are DISABLED
Mar 30 01:58:38 pve kernel: Unless you actually understand what nomodeset does, you should reboot without enabling it
Mar 30 01:58:38 pve kernel: pci 0000:0c:00.0: BAR 0: assigned to efifb
Mar 30 01:58:38 pve kernel: pci 0000:0b:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
Mar 30 01:58:38 pve kernel: pci 0000:0c:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
Mar 30 01:58:38 pve kernel: pci 0000:0b:00.0: vgaarb: bridge control possible
Mar 30 01:58:38 pve kernel: pci 0000:0c:00.0: vgaarb: bridge control possible
Mar 30 01:58:38 pve kernel: pci 0000:0c:00.0: vgaarb: setting as boot device
Mar 30 01:58:38 pve kernel: vgaarb: loaded
Mar 30 01:58:38 pve kernel: vfio-pci 0000:0b:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
Mar 30 01:58:38 pve kernel: vfio_pci: add [1002:73df[ffffffff:ffffffff]] class 0x000000/00000000
Mar 30 01:58:38 pve kernel: vfio_pci: add [1002:ab28[ffffffff:ffffffff]] class 0x000000/00000000
Mar 30 01:58:38 pve kernel: vfio_pci: add [1002:67df[ffffffff:ffffffff]] class 0x000000/00000000
Mar 30 01:58:38 pve kernel: vfio_pci: add [1002:aaf0[ffffffff:ffffffff]] class 0x000000/00000000
Mar 30 01:58:38 pve kernel: vfio-pci 0000:0c:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
Mar 30 01:58:38 pve kernel: vfio_pci: add [1002:699f[ffffffff:ffffffff]] class 0x000000/00000000
Mar 30 01:58:38 pve kernel: vfio_pci: add [1002:aae0[ffffffff:ffffffff]] class 0x000000/00000000
Mar 30 01:58:38 pve systemd-modules-load[1822]: Inserted module 'vfio'
Mar 30 01:58:38 pve systemd-modules-load[1822]: Inserted module 'vfio_pci'
Mar 30 01:58:38 pve systemd-modules-load[1822]: Inserted module 'vendor_reset'
Mar 30 01:58:39 pve kernel: vendor_reset_hook: installed

AT THIS POINT I CAN START A VM WHICH USES THE RX6700XT SUCCESSFULLY.
  • Q35 machine
  • Host CPU type
  • OVMF BIOS

This is what is printed in the log when I start the VM:

Code:
Mar 30 02:09:04 pve pvedaemon[14713]: start VM 102: UPID:pve:00003979:0000F5E6:6243ADB0:qmstart:102:root@pam:
Mar 30 02:09:04 pve pvedaemon[3401]: <root@pam> starting task UPID:pve:00003979:0000F5E6:6243ADB0:qmstart:102:root@pam:
Mar 30 02:09:04 pve systemd[1]: Created slice qemu.slice.
Mar 30 02:09:04 pve systemd[1]: Started 102.scope.
Mar 30 02:09:04 pve systemd-udevd[14727]: Using default interface naming scheme 'v247'.
Mar 30 02:09:04 pve systemd-udevd[14727]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Mar 30 02:09:04 pve kernel: device tap102i0 entered promiscuous mode
Mar 30 02:09:04 pve kernel: vmbr0: port 2(tap102i0) entered blocking state
Mar 30 02:09:04 pve kernel: vmbr0: port 2(tap102i0) entered disabled state
Mar 30 02:09:04 pve kernel: vmbr0: port 2(tap102i0) entered blocking state
Mar 30 02:09:04 pve kernel: vmbr0: port 2(tap102i0) entered forwarding state
Mar 30 02:09:06 pve kernel: vfio-pci 0000:0b:00.0: enabling device (0002 -> 0003)
Mar 30 02:09:06 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
Mar 30 02:09:06 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
Mar 30 02:09:06 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
Mar 30 02:09:06 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
Mar 30 02:09:06 pve kernel: vfio-pci 0000:0b:00.1: enabling device (0000 -> 0002)
Mar 30 02:09:07 pve pvedaemon[3401]: <root@pam> end task UPID:pve:00003979:0000F5E6:6243ADB0:qmstart:102:root@pam: OK

Now here is the problem: suppose I shutdown the VM using the GPU. In the host logs I see:

Code:
Mar 30 02:10:30 pve pvedaemon[24579]: shutdown VM 102: UPID:pve:00006003:0001179A:6243AE06:qmshutdown:102:root@pam:
Mar 30 02:10:30 pve pvedaemon[3401]: <root@pam> starting task UPID:pve:00006003:0001179A:6243AE06:qmshutdown:102:root@pam:
Mar 30 02:10:32 pve QEMU[14722]: kvm: terminating on signal 15 from pid 2536 (/usr/sbin/qmeventd)
Mar 30 02:10:32 pve kernel:  zd160: p1 p2
Mar 30 02:10:32 pve kernel: vmbr0: port 2(tap102i0) entered disabled state
Mar 30 02:10:32 pve pvedaemon[3401]: <root@pam> end task UPID:pve:00006003:0001179A:6243AE06:qmshutdown:102:root@pam: OK
Mar 30 02:10:32 pve qmeventd[25228]: Starting cleanup for 102
Mar 30 02:10:32 pve qmeventd[25228]: Finished cleanup for 102
Mar 30 02:10:33 pve systemd[1]: 102.scope: Succeeded.
Mar 30 02:10:33 pve systemd[1]: 102.scope: Consumed 37.892s CPU time.

Here is where the problems begin:

If I try to restart the VM now for a second time, I get no signal on my screen (black screen) and in the host logs I see this erro:

Code:
Mar 30 02:12:36 pve pvedaemon[26528]: start VM 102: UPID:pve:000067A0:000148D0:6243AE84:qmstart:102:root@pam:
Mar 30 02:12:36 pve pvedaemon[3400]: <root@pam> starting task UPID:pve:000067A0:000148D0:6243AE84:qmstart:102:root@pam:
Mar 30 02:12:36 pve systemd[1]: Started 102.scope.
Mar 30 02:12:36 pve systemd-udevd[26542]: Using default interface naming scheme 'v247'.
Mar 30 02:12:36 pve systemd-udevd[26542]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Mar 30 02:12:36 pve kernel: device tap102i0 entered promiscuous mode
Mar 30 02:12:36 pve kernel: vmbr0: port 2(tap102i0) entered blocking state
Mar 30 02:12:36 pve kernel: vmbr0: port 2(tap102i0) entered disabled state
Mar 30 02:12:36 pve kernel: vmbr0: port 2(tap102i0) entered blocking state
Mar 30 02:12:36 pve kernel: vmbr0: port 2(tap102i0) entered forwarding state
Mar 30 02:12:38 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
Mar 30 02:12:38 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
Mar 30 02:12:38 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
Mar 30 02:12:38 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
Mar 30 02:12:39 pve kernel: vfio-pci 0000:0b:00.1: vfio_bar_restore: reset recovery - restoring BARs
Mar 30 02:12:39 pve kernel: vfio-pci 0000:0b:00.0: vfio_bar_restore: reset recovery - restoring BARs
Mar 30 02:12:39 pve pvedaemon[3400]: <root@pam> end task UPID:pve:000067A0:000148D0:6243AE84:qmstart:102:root@pam: OK
Mar 30 02:12:41 pve kernel: vfio-pci 0000:0b:00.0: vfio_bar_restore: reset recovery - restoring BARs
... MANY TIMES
Mar 30 02:12:52 pve kernel: vfio-pci 0000:0b:00.0: vfio_bar_restore: reset recovery - restoring BARs

Note that even though there is no display output, the VM is operational. I can ssh into the VM over the network and run commands.

This is what I see in the hsot logs when I use the shutdown button (the Qemu Agent picks it up and stops the VM):

Code:
Mar 30 02:15:41 pve pvedaemon[33036]: shutdown VM 102: UPID:pve:0000810C:000190DA:6243AF3D:qmshutdown:102:root@pam:
Mar 30 02:15:41 pve pvedaemon[3401]: <root@pam> starting task UPID:pve:0000810C:000190DA:6243AF3D:qmshutdown:102:root@pam:
Mar 30 02:15:41 pve QEMU[26537]: kvm: terminating on signal 15 from pid 2536 (/usr/sbin/qmeventd)
Mar 30 02:15:41 pve kernel:  zd160: p1 p2
Mar 30 02:15:41 pve kernel: vmbr0: port 2(tap102i0) entered disabled state
Mar 30 02:15:42 pve pvedaemon[3401]: <root@pam> end task UPID:pve:0000810C:000190DA:6243AF3D:qmshutdown:102:root@pam: OK
Mar 30 02:15:42 pve qmeventd[33542]: Starting cleanup for 102
Mar 30 02:15:42 pve qmeventd[33542]: Finished cleanup for 102
Mar 30 02:15:43 pve systemd[1]: 102.scope: Succeeded.
Mar 30 02:15:43 pve systemd[1]: 102.scope: Consumed 2min 7.869s CPU time.

So my questions is: how do I fix this? How can I get the VM to display output the second time around?

Note that I can try to restart the VM a one more time, the situation gets much worse. The VM fails to start (I cannot even reach it over the network as it seems to not boot at all) and in the host logs I see:

Code:
Mar 30 02:17:23 pve pvedaemon[34628]: start VM 102: UPID:pve:00008744:0001B90E:6243AFA3:qmstart:102:root@pam:
Mar 30 02:17:23 pve pvedaemon[3402]: <root@pam> starting task UPID:pve:00008744:0001B90E:6243AFA3:qmstart:102:root@pam:
Mar 30 02:17:24 pve kernel: vfio-pci 0000:0b:00.0: timed out waiting for pending transaction; performing function level reset anyway
Mar 30 02:17:25 pve kernel: vfio-pci 0000:0b:00.0: not ready 1023ms after FLR; waiting
Mar 30 02:17:27 pve kernel: vfio-pci 0000:0b:00.0: not ready 2047ms after FLR; waiting
Mar 30 02:17:29 pve kernel: vfio-pci 0000:0b:00.0: not ready 4095ms after FLR; waiting
Mar 30 02:17:33 pve kernel: vfio-pci 0000:0b:00.0: not ready 8191ms after FLR; waiting
Mar 30 02:17:42 pve kernel: vfio-pci 0000:0b:00.0: not ready 16383ms after FLR; waiting
Mar 30 02:17:59 pve kernel: vfio-pci 0000:0b:00.0: not ready 32767ms after FLR; waiting
Mar 30 02:18:34 pve kernel: vfio-pci 0000:0b:00.0: not ready 65535ms after FLR; giving up
Mar 30 02:18:34 pve systemd[1]: Started 102.scope.
Mar 30 02:18:34 pve systemd-udevd[35658]: Using default interface naming scheme 'v247'.
Mar 30 02:18:34 pve systemd-udevd[35658]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Mar 30 02:18:34 pve kernel: device tap102i0 entered promiscuous mode
Mar 30 02:18:34 pve kernel: vmbr0: port 2(tap102i0) entered blocking state
Mar 30 02:18:34 pve kernel: vmbr0: port 2(tap102i0) entered disabled state
Mar 30 02:18:34 pve kernel: vmbr0: port 2(tap102i0) entered blocking state
Mar 30 02:18:34 pve kernel: vmbr0: port 2(tap102i0) entered forwarding state
Mar 30 02:18:36 pve kernel: vfio-pci 0000:0b:00.0: can't change power state from D3cold to D0 (config space inaccessible)
Mar 30 02:18:36 pve kernel: vfio-pci 0000:0b:00.0: can't change power state from D3cold to D0 (config space inaccessible)
Mar 30 02:18:36 pve kernel: vfio-pci 0000:0b:00.0: timed out waiting for pending transaction; performing function level reset anyway
Mar 30 02:18:37 pve kernel: vfio-pci 0000:0b:00.0: not ready 1023ms after FLR; waiting
Mar 30 02:18:39 pve kernel: vfio-pci 0000:0b:00.0: not ready 2047ms after FLR; waiting
Mar 30 02:18:41 pve kernel: vfio-pci 0000:0b:00.0: not ready 4095ms after FLR; waiting
Mar 30 02:18:41 pve pvedaemon[3400]: VM 102 qmp command failed - VM 102 qmp command 'query-proxmox-support' failed - unable to connect to VM 102 qmp socket - timeout after 31 retries
...
Mar 30 02:19:45 pve kernel: vfio-pci 0000:0b:00.0: not ready 65535ms after FLR; giving up
Mar 30 02:19:45 pve kernel: vfio-pci 0000:0b:00.0: vfio_cap_init: hiding cap 0xff@0xff
... MANY TIMES
Mar 30 02:19:45 pve kernel: vfio-pci 0000:0b:00.0: vfio_cap_init: hiding cap 0xff@0xff
Mar 30 02:19:45 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0xffff@0x100
Mar 30 02:19:45 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0xffff@0xffc
... MANY TIMES
Mar 30 02:19:45 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0xffff@0xffc
Mar 30 02:19:45 pve kernel: vmbr0: port 2(tap102i0) entered disabled state
Mar 30 02:19:45 pve kernel: vmbr0: port 2(tap102i0) entered disabled state
Mar 30 02:19:45 pve kernel: vfio-pci 0000:0b:00.0: can't change power state from D3cold to D0 (config space inaccessible)
Mar 30 02:19:46 pve systemd[1]: 102.scope: Succeeded.
Mar 30 02:19:46 pve systemd[1]: 102.scope: Consumed 1.906s CPU time.
 
Last edited:
I should note that:

  • The 6700XT is passed through with the ROM
Code:
hostpci0: 0000:0b:00,pcie=1,x-vga=1,romfile=Gigabyte.RX6700XT.12288.210318.rom
  • If I try the same with the RX550 (which uses the vendor-reset) everything works
 
Also note that if I suspend/resume the host, the issue is resolved and I can then successfully restart the VM and use the 6700XT in the VM.

Here is the command I use:

Bash:
#!/bin/bash

N=4

echo "1" | tee -a /sys/bus/pci/devices/0000\:0b\:00.0/remove
echo "1" | tee -a /sys/bus/pci/devices/0000\:0b\:00.1/remove


# echo -n mem > /sys/power/state
rtcwake -v -m no -s $N
echo Scheduled auto-resume in $N seconds
echo COMPUTER WILL GO TO SLEEP, PRESS POWER BUTTON TO RESUME MANUALLY
systemctl suspend

sleep 1
echo ============================================================
echo COMPUTER RESUMED, WAITING 5 SECONDS BEFORE PCI RESCAN
sleep 4s
echo "1" | tee -a /sys/bus/pci/rescan
echo "Reset done"

I see in the host logs when I resume:

Code:
Mar 30 02:29:15 pve kernel: vfio-pci 0000:0b:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=none
Mar 30 02:29:15 pve kernel: pci 0000:0b:00.0: Removing from iommu group 26
Mar 30 02:29:15 pve kernel: pci 0000:0b:00.1: Removing from iommu group 27
Mar 30 02:29:15 pve systemd[1]: Reached target Sleep.
Mar 30 02:29:15 pve systemd[1]: Starting Suspend...
Mar 30 02:29:15 pve systemd-sleep[43052]: Suspending system...
Mar 30 02:29:15 pve kernel: PM: suspend entry (deep)
Mar 30 02:29:21 pve kernel: Filesystems sync: 0.337 seconds
Mar 30 02:29:21 pve kernel: Freezing user space processes ... (elapsed 0.000 seconds) done.
Mar 30 02:29:21 pve kernel: OOM killer disabled.
Mar 30 02:29:21 pve kernel: Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
Mar 30 02:29:21 pve kernel: printk: Suspending console(s) (use no_console_suspend to debug)
Mar 30 02:29:21 pve kernel: wlp4s0: deauthenticating from 3c:84:6a:92:35:4f by local choice (Reason: 3=DEAUTH_LEAVING)
Mar 30 02:29:21 pve kernel: serial 00:04: disabled
Mar 30 02:29:21 pve kernel: sd 5:0:0:0: [sdb] Synchronizing SCSI cache
Mar 30 02:29:21 pve kernel: sd 4:0:0:0: [sda] Synchronizing SCSI cache
Mar 30 02:29:21 pve kernel: sd 5:0:0:0: [sdb] Stopping disk
Mar 30 02:29:21 pve kernel: sd 4:0:0:0: [sda] Stopping disk
Mar 30 02:29:21 pve kernel: ACPI: Preparing to enter system sleep state S3
Mar 30 02:29:21 pve kernel: PM: Saving platform NVS memory
Mar 30 02:29:21 pve kernel: Disabling non-boot CPUs ...
Mar 30 02:29:21 pve kernel: IRQ 114: no longer affine to CPU1
Mar 30 02:29:21 pve kernel: smpboot: CPU 1 is now offline
...
Mar 30 02:29:21 pve kernel: smpboot: CPU 31 is now offline
Mar 30 02:29:21 pve kernel: ACPI: Low-level resume complete
Mar 30 02:29:21 pve kernel: PM: Restoring platform NVS memory
Mar 30 02:29:21 pve kernel: LVT offset 0 assigned for vector 0x400
Mar 30 02:29:21 pve kernel: Enabling non-boot CPUs ...
Mar 30 02:29:21 pve kernel: x86: Booting SMP configuration:
Mar 30 02:29:21 pve kernel: smpboot: Booting Node 0 Processor 1 APIC 0x2
Mar 30 02:29:21 pve kernel: microcode: CPU1: patch_level=0x0a201016
Mar 30 02:29:21 pve kernel: ACPI: \_PR_.C002: Found 2 idle states
Mar 30 02:29:21 pve kernel: CPU1 is up
...
Mar 30 02:29:21 pve kernel: smpboot: Booting Node 0 Processor 31 APIC 0x1f
Mar 30 02:29:21 pve kernel: microcode: CPU31: patch_level=0x0a201016
Mar 30 02:29:21 pve kernel: ACPI: \_PR_.C01F: Found 2 idle states
Mar 30 02:29:21 pve kernel: CPU31 is up
Mar 30 02:29:21 pve kernel: ACPI: Waking up from system sleep state S3
Mar 30 02:29:21 pve kernel: sd 5:0:0:0: [sdb] Starting disk
Mar 30 02:29:21 pve kernel: sd 4:0:0:0: [sda] Starting disk
Mar 30 02:29:21 pve kernel: serial 00:04: activated
Mar 30 02:29:21 pve kernel: ata3: SATA link down (SStatus 0 SControl 300)
Mar 30 02:29:21 pve kernel: ata10: SATA link down (SStatus 0 SControl 300)
Mar 30 02:29:21 pve kernel: ata4: SATA link down (SStatus 0 SControl 300)
Mar 30 02:29:21 pve kernel: ata9: SATA link down (SStatus 0 SControl 300)
Mar 30 02:29:21 pve kernel: OOM killer enabled.
Mar 30 02:29:21 pve kernel: Restarting tasks ... done.
Mar 30 02:29:21 pve kernel: PM: suspend exit
Mar 30 02:29:21 pve wpa_supplicant[2769]: wlp4s0: CTRL-EVENT-DISCONNECTED bssid=3c:84:6a:92:35:4f reason=3 locally_generated=1
Mar 30 02:29:21 pve systemd-sleep[43052]: System resumed.
Mar 30 02:29:21 pve wpa_supplicant[2769]: wlp4s0: CTRL-EVENT-REGDOM-CHANGE init=DRIVER type=WORLD
Mar 30 02:29:21 pve kernel: usb 7-2.4: new full-speed USB device number 7 using xhci_hcd
Mar 30 02:29:21 pve kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Mar 30 02:29:21 pve kernel: ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Mar 30 02:29:21 pve kernel: ata5.00: supports DRM functions and may not be fully accessible
Mar 30 02:29:21 pve kernel: ata6.00: supports DRM functions and may not be fully accessible
Mar 30 02:29:21 pve kernel: ata5.00: supports DRM functions and may not be fully accessible
Mar 30 02:29:21 pve kernel: ata5.00: configured for UDMA/133
Mar 30 02:29:21 pve kernel: ata6.00: supports DRM functions and may not be fully accessible
Mar 30 02:29:21 pve kernel: ata6.00: configured for UDMA/133
Mar 30 02:29:21 pve systemd-sleep[43374]: /dev/sda:
Mar 30 02:29:21 pve systemd-sleep[43374]:  setting Advanced Power Management level to 0xfe (254)
Mar 30 02:29:21 pve systemd-sleep[43374]:  APM_level        = 254
Mar 30 02:29:21 pve systemd-sleep[43409]: /dev/sdb:
Mar 30 02:29:21 pve systemd-sleep[43409]:  setting Advanced Power Management level to 0xfe (254)
Mar 30 02:29:21 pve systemd-sleep[43409]:  APM_level        = 254
Mar 30 02:29:21 pve kernel: usb 7-2.4: device descriptor read/64, error -71
Mar 30 02:29:21 pve systemd[1]: systemd-suspend.service: Succeeded.
Mar 30 02:29:21 pve systemd[1]: Finished Suspend.
Mar 30 02:29:21 pve systemd[1]: Stopped target Sleep.
Mar 30 02:29:21 pve systemd[1]: Reached target Suspend.
Mar 30 02:29:21 pve systemd-logind[2547]: Operation 'sleep' finished.
Mar 30 02:29:21 pve systemd[1]: Stopped target Suspend.
Mar 30 02:29:21 pve kernel: usb 7-2.4: device descriptor read/64, error -71
Mar 30 02:29:22 pve kernel: usb 7-2.4: new full-speed USB device number 8 using xhci_hcd
Mar 30 02:29:22 pve kernel: usb 7-2.4: device descriptor read/64, error -71
Mar 30 02:29:22 pve kernel: usb 7-2.4: device descriptor read/64, error -71
Mar 30 02:29:22 pve kernel: usb 7-2-port4: attempt power cycle
Mar 30 02:29:23 pve kernel: usb 7-2.4: new full-speed USB device number 9 using xhci_hcd
Mar 30 02:29:23 pve kernel: usb 7-2.4: device descriptor read/8, error 0
Mar 30 02:29:23 pve kernel: usb 7-2.4: device descriptor read/8, error 0
Mar 30 02:29:23 pve kernel: usb 7-2.4: new full-speed USB device number 10 using xhci_hcd
Mar 30 02:29:23 pve kernel: usb 7-2.4: device descriptor read/8, error 0
Mar 30 02:29:23 pve kernel: usb 7-2.4: device descriptor read/8, error 0
Mar 30 02:29:23 pve kernel: usb 7-2-port4: unable to enumerate USB device
Mar 30 02:29:25 pve wpa_supplicant[2769]: wlp4s0: SME: Trying to authenticate with 3c:84:6a:92:35:4e (SSID='savvinio' freq=2437 MHz)
Mar 30 02:29:25 pve kernel: wlp4s0: authenticate with 3c:84:6a:92:35:4e
Mar 30 02:29:25 pve wpa_supplicant[2769]: wlp4s0: CTRL-EVENT-REGDOM-CHANGE init=DRIVER type=COUNTRY alpha2=GB
Mar 30 02:29:25 pve kernel: wlp4s0: send auth to 3c:84:6a:92:35:4e (try 1/3)
Mar 30 02:29:25 pve wpa_supplicant[2769]: wlp4s0: Trying to associate with 3c:84:6a:92:35:4e (SSID='savvinio' freq=2437 MHz)
Mar 30 02:29:25 pve kernel: wlp4s0: authenticated
Mar 30 02:29:25 pve kernel: wlp4s0: associate with 3c:84:6a:92:35:4e (try 1/3)
Mar 30 02:29:25 pve kernel: wlp4s0: RX AssocResp from 3c:84:6a:92:35:4e (capab=0x431 status=0 aid=5)
Mar 30 02:29:25 pve wpa_supplicant[2769]: wlp4s0: Associated with 3c:84:6a:92:35:4e
Mar 30 02:29:25 pve wpa_supplicant[2769]: wlp4s0: CTRL-EVENT-SUBNET-STATUS-UPDATE status=0
Mar 30 02:29:25 pve kernel: wlp4s0: associated
Mar 30 02:29:25 pve wpa_supplicant[2769]: wlp4s0: WPA: Key negotiation completed with 3c:84:6a:92:35:4e [PTK=CCMP GTK=TKIP]
Mar 30 02:29:25 pve wpa_supplicant[2769]: wlp4s0: CTRL-EVENT-CONNECTED - Connection to 3c:84:6a:92:35:4e completed [id=0 id_str=]
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.0: [1002:73df] type 00 class 0x030000
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.0: reg 0x10: [mem 0x00000000-0x0fffffff 64bit pref]
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.0: reg 0x18: [mem 0x00000000-0x001fffff 64bit pref]
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.0: reg 0x20: [io  0x0000-0x00ff]
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.0: reg 0x24: [mem 0x00000000-0x000fffff]
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.0: Max Payload Size set to 256 (was 128, max 256)
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.0: PME# supported from D1 D2 D3hot D3cold
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.0: 126.024 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x8 link at 0000:00:03.1 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.0: Adding to iommu group 26
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.1: [1002:ab28] type 00 class 0x040300
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.1: Max Payload Size set to 256 (was 128, max 256)
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.1: PME# supported from D1 D2 D3hot D3cold
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.1: Adding to iommu group 27
Mar 30 02:29:25 pve kernel: pcieport 0000:0a:00.0: ASPM: current common clock configuration is inconsistent, reconfiguring
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.0: BAR 0: assigned [mem 0xd0000000-0xdfffffff 64bit pref]
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.0: BAR 2: assigned [mem 0xe0000000-0xe01fffff 64bit pref]
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.0: BAR 5: assigned [mem 0xfcc00000-0xfccfffff]
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.0: BAR 6: assigned [mem 0xfcd00000-0xfcd1ffff pref]
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.1: BAR 0: assigned [mem 0xfcd20000-0xfcd23fff]
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.0: BAR 4: assigned [io  0xf000-0xf0ff]
Mar 30 02:29:25 pve kernel: vfio-pci 0000:0b:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
Mar 30 02:29:25 pve kernel: pci 0000:0b:00.1: D0 power state depends on 0000:0b:00.0
 
After exhausting all options I looked at different 6000 series cards and ended up replacing the Aorus 6700 XT: there is simply something worng with that particular card.

I am now using:
  • MSI Radeon RX 6700 XT Mech 2X 12GB
  • Asrock Radeon RX 6800 Challenger Pro 16GB
Both of them work without issues with the same setup (kernel configuration and BIOS settings). As far as I'm concerned, whatever issue exists lies with the Gigabyte Aorus Elite 6700.

References:
[1] https://forum.level1techs.com/t/6700xt-reset-bug/181814
[2] https://www.reddit.com/r/VFIO/comments/tq9j5v/need_help_compiling_a_list_of_amd_6000_series/
 
  • Like
Reactions: leesteken
Can you please share the IOMMU groups of the X570S Aero G: for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done? I'm considering it for my last AM4 build.
 
Can you please share the IOMMU groups of the X570S Aero G: for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done? I'm considering it for my last AM4 build.
Sure:

Code:
root@pve:~# for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done
IOMMU group 0 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU group 10 00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU group 11 00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
IOMMU group 12 00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 61)
IOMMU group 12 00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
IOMMU group 13 00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 0 [1022:1440]
IOMMU group 13 00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 1 [1022:1441]
IOMMU group 13 00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 2 [1022:1442]
IOMMU group 13 00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 3 [1022:1443]
IOMMU group 13 00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 4 [1022:1444]
IOMMU group 13 00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 5 [1022:1445]
IOMMU group 13 00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 6 [1022:1446]
IOMMU group 13 00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 7 [1022:1447]
IOMMU group 14 01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Matisse Switch Upstream [1022:57ad]
IOMMU group 15 02:03.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge [1022:57a3]
IOMMU group 16 02:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge [1022:57a3]
IOMMU group 17 02:05.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge [1022:57a3]
IOMMU group 18 02:08.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge [1022:57a4]
IOMMU group 18 06:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP [1022:1485]
IOMMU group 18 06:00.1 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:149c]
IOMMU group 18 06:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:149c]
IOMMU group 19 02:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge [1022:57a4]
IOMMU group 19 07:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
IOMMU group 1 00:01.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU group 20 02:0a.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge [1022:57a4]
IOMMU group 20 08:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
IOMMU group 21 03:00.0 USB controller [0c03]: ASMedia Technology Inc. Device [1b21:3241]
IOMMU group 22 04:00.0 Network controller [0280]: Intel Corporation Wi-Fi 6 AX200 [8086:2723] (rev 1a)
IOMMU group 23 05:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller I225-V [8086:15f3] (rev 01)
IOMMU group 24 09:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c3)
IOMMU group 25 0a:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
IOMMU group 26 0b:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] [1002:73bf] (rev c3)
IOMMU group 27 0b:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:ab28]
IOMMU group 28 0c:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c5)
IOMMU group 29 0d:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
IOMMU group 2 00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU group 30 0e:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [1002:73df] (rev c5)
IOMMU group 31 0e:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:ab28]
IOMMU group 32 0f:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function [1022:148a]
IOMMU group 33 10:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP [1022:1485]
IOMMU group 34 10:00.1 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP [1022:1486]
IOMMU group 35 10:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:149c]
IOMMU group 36 10:00.4 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller [1022:1487]
IOMMU group 3 00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU group 4 00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU group 5 00:03.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
IOMMU group 6 00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU group 7 00:05.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU group 8 00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
IOMMU group 9 00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
 
  • Like
Reactions: we21 and leesteken
I happened to get with X570S AERO G revision 1.1 with a MEDIATEK WiFi [14c3:0608] instead of an Intel. In case anyone run into the issue that Linux kernel pci driver cannot map the BARs of the mt7921e device: disable above 4G decoding in the BIOS. And use a very recent kernel version. PCIe passthrough works fine.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!