EDIT: this is the output to lscpi -v on Proxmox host:
Does this mean that the kernel is still using the card? Should I also blacklist amdgpu as well?
There is a distinct and repeatable series of events which are very confusing, diagnosing the cause of which is far beyond my expertise.
The short version is that I am having trouble with PCIe passthrough of an AMD Radeon RX 6600 on an ArchLinux VM Guest. Basically every time I shut down the VM with PCI passthrough, I have to do a full reboot of Proxmox before I can boot that VM again. It's very frustrating.
The hardware is as follows:
- Gigabyte MZ72-HBO
- AMD Epyc 7402
- AMD Radeon RX 6600
The long version is as follows:
Prior adding interrupt remapping, I had to perform a full reboot of proxmox and also had to remove then re-add the PCI device (steps listed in below) to get the VM to boot. Simply rebooting the VM did not work. The syslogs are included below. Then I enabled interrupt remapping with the following commands: enable interrupt remapping: echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > /etc/modprobe.d/iommu_unsafe_interrupts.conf echo "options kvm ignore_msrs=1" > /etc/modprobe.d/kvm.conf
Steps to get the VM booting before enabling interrupt remapping: To get the VM to boot with the PCI device after it has been shutdown I have to do the following steps exactly: reboot proxmox, remove the PCI device from the VM, boot the VM without the PCI device, shut down the VM, reboot proxmox again, boot the VM without the PCI device, shut down the VM, add the PCI device to the VM, boot the VM with the PCI device.
Since enabling interrupt remapping, the VM boots and shuts down just fine the first time but after that I need to reboot proxmox. Also, as I am shutting the VM down with the PCI device installed, what appears to be some kind of kernel panic is output to the noVNC viewer. See below.
This is the Syslog when booting the VM the second time prior rebooting proxmox:
This is what I assume to be some kind of kernel panic on shutting down the VM:
Installation and configuration
Here is my /etc/default/grub file:
I followed a series of different posts all mashed together to form some kind of cogent guide.
The exact steps are as follows:
ON PROXMOX:
- enable IOMMU in BIOS (enabled by default)
- by default SVM mode and SR-IOV are also also enabled by default in the BIOS (there are no visible options for disabling CSM)
- set the following kernel parameters in Grub on Proxmox: amd_iommu=on, iommu=pt, pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafbff,efifbff
- add the following to /etc/moduels: vfio, vfio_iommu_type1, vfio_pci, vfio_virqfd
- test that remapping is enabled: dmesg | grep 'remapping'
- confirm dedicated groups: find /sys/kernel/iommu_groups/ -type l
- update-grub
- blacklist all drivers: echo "blacklist radeon" >> /etc/modprobe.d/blacklist.conf, echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf, echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf
- add the vendor ids to block host from taking device: echo "options vfio-pci ids=<id>:<id>,<id>:<id> disable_vga=1" > /etc/modprobe.d/vfio.conf
- enable interrupt remapping: echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > /etc/modprobe.d/iommu_unsafe_interrupts.conf echo "options kvm ignore_msrs=1" > /etc/modprobe.d/kvm.conf
- update-initramfs -u
- reset
ON THE GUEST:
- installed the following drivers: mesa, lib32-mesa, xf86-video-amdgpu, amdvlk, lib32-amdvlk, mesa-vdpau, lib32-mesa-vdpau
This is my guest VM config:
Arch Linux 5.15.8-arch1-1
Here is the output from lspci -v under the VGA controller:
Does this mean that the kernel is still using the card? Should I also blacklist amdgpu as well?
Code:
83:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: vfio-pci
Kernel modules: amdgpu
There is a distinct and repeatable series of events which are very confusing, diagnosing the cause of which is far beyond my expertise.
The short version is that I am having trouble with PCIe passthrough of an AMD Radeon RX 6600 on an ArchLinux VM Guest. Basically every time I shut down the VM with PCI passthrough, I have to do a full reboot of Proxmox before I can boot that VM again. It's very frustrating.
The hardware is as follows:
- Gigabyte MZ72-HBO
- AMD Epyc 7402
- AMD Radeon RX 6600
The long version is as follows:
Prior adding interrupt remapping, I had to perform a full reboot of proxmox and also had to remove then re-add the PCI device (steps listed in below) to get the VM to boot. Simply rebooting the VM did not work. The syslogs are included below. Then I enabled interrupt remapping with the following commands: enable interrupt remapping: echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > /etc/modprobe.d/iommu_unsafe_interrupts.conf echo "options kvm ignore_msrs=1" > /etc/modprobe.d/kvm.conf
Steps to get the VM booting before enabling interrupt remapping: To get the VM to boot with the PCI device after it has been shutdown I have to do the following steps exactly: reboot proxmox, remove the PCI device from the VM, boot the VM without the PCI device, shut down the VM, reboot proxmox again, boot the VM without the PCI device, shut down the VM, add the PCI device to the VM, boot the VM with the PCI device.
Since enabling interrupt remapping, the VM boots and shuts down just fine the first time but after that I need to reboot proxmox. Also, as I am shutting the VM down with the PCI device installed, what appears to be some kind of kernel panic is output to the noVNC viewer. See below.
This is the Syslog when booting the VM the second time prior rebooting proxmox:
Code:
Dec 16 22:01:46 central pvedaemon[10228]: start VM 310: UPID:central:000027F4:00006B22:61BBB74A:qmstart:310:root@pam:
Dec 16 22:01:46 central pvedaemon[4441]: <root@pam> starting task UPID:central:000027F4:00006B22:61BBB74A:qmstart:310:root@pam:
Dec 16 22:01:46 central systemd[1]: Created slice qemu.slice.
Dec 16 22:01:46 central systemd[1]: Started 310.scope.
Dec 16 22:01:46 central systemd-udevd[10009]: Using default interface naming scheme 'v247'.
Dec 16 22:01:46 central systemd-udevd[10009]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Dec 16 22:01:47 central kernel: device tap310i0 entered promiscuous mode
Dec 16 22:01:47 central systemd-udevd[10243]: Using default interface naming scheme 'v247'.
Dec 16 22:01:47 central systemd-udevd[10243]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Dec 16 22:01:47 central systemd-udevd[10243]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Dec 16 22:01:47 central systemd-udevd[10009]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Dec 16 22:01:47 central kernel: fwbr310i0: port 1(fwln310i0) entered blocking state
Dec 16 22:01:47 central kernel: fwbr310i0: port 1(fwln310i0) entered disabled state
Dec 16 22:01:47 central kernel: device fwln310i0 entered promiscuous mode
Dec 16 22:01:47 central kernel: fwbr310i0: port 1(fwln310i0) entered blocking state
Dec 16 22:01:47 central kernel: fwbr310i0: port 1(fwln310i0) entered forwarding state
Dec 16 22:01:47 central kernel: vmbr20: port 2(fwpr310p0) entered blocking state
Dec 16 22:01:47 central kernel: vmbr20: port 2(fwpr310p0) entered disabled state
Dec 16 22:01:47 central kernel: device fwpr310p0 entered promiscuous mode
Dec 16 22:01:47 central kernel: device eno2np1.20 entered promiscuous mode
Dec 16 22:01:47 central kernel: device eno2np1 entered promiscuous mode
Dec 16 22:01:47 central kernel: vmbr20: port 2(fwpr310p0) entered blocking state
Dec 16 22:01:47 central kernel: vmbr20: port 2(fwpr310p0) entered forwarding state
Dec 16 22:01:47 central kernel: fwbr310i0: port 2(tap310i0) entered blocking state
Dec 16 22:01:47 central kernel: fwbr310i0: port 2(tap310i0) entered disabled state
Dec 16 22:01:47 central kernel: fwbr310i0: port 2(tap310i0) entered blocking state
Dec 16 22:01:47 central kernel: fwbr310i0: port 2(tap310i0) entered forwarding state
Dec 16 22:01:49 central kernel: vfio-pci 0000:83:00.0: enabling device (0002 -> 0003)
Dec 16 22:01:49 central kernel: vfio-pci 0000:83:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
Dec 16 22:01:49 central kernel: vfio-pci 0000:83:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
Dec 16 22:01:49 central kernel: vfio-pci 0000:83:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
Dec 16 22:01:49 central kernel: vfio-pci 0000:83:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
Dec 16 22:01:49 central pvedaemon[4441]: <root@pam> end task UPID:central:000027F4:00006B22:61BBB74A:qmstart:310:root@pam: OK
Dec 16 22:01:50 central pvedaemon[10289]: starting vnc proxy UPID:central:00002831:00006C81:61BBB74E:vncproxy:310:root@pam:
Dec 16 22:01:50 central pvedaemon[4442]: <root@pam> starting task UPID:central:00002831:00006C81:61BBB74E:vncproxy:310:root@pam:
Dec 16 22:01:57 central kernel: kvm [10237]: ignored rdmsr: 0xc0011020 data 0x0
Dec 16 22:01:58 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:01:58 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:01:58 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:01:58 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:01:58 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:01:58 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:01:58 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:01:58 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:01:58 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:02 central kernel: kvm_msr_ignored_check: 6137 callbacks suppressed
Dec 16 22:02:02 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:02 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:02 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:02 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:03 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:03 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:03 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:03 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:03 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:03 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:07 central kernel: kvm_msr_ignored_check: 378 callbacks suppressed
Dec 16 22:02:07 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:07 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:07 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:07 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:07 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:07 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:07 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:07 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:08 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:08 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:13 central kernel: kvm_msr_ignored_check: 36 callbacks suppressed
Dec 16 22:02:13 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:13 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:13 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:13 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:14 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:14 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:14 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:14 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:15 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:15 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:19 central kernel: kvm_msr_ignored_check: 4 callbacks suppressed
Dec 16 22:02:19 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:19 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:22 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:22 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:25 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:25 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:27 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:27 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:28 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:28 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:31 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:31 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:35 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:35 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:35 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:02:35 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:35 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:35 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:35 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:02:35 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:03:48 central kernel: kvm_msr_ignored_check: 49 callbacks suppressed
Dec 16 22:03:48 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x400
Dec 16 22:03:48 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
Dec 16 22:03:48 central kernel: kvm [10237]: ignored wrmsr: 0xc0011020 data 0x0
This is what I assume to be some kind of kernel panic on shutting down the VM:
Installation and configuration
Here is my /etc/default/grub file:
Code:
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafb:off,efifb:off"
GRUB_CMDLINE_LINUX=""
# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"
# Uncomment to disable graphical terminal (grub-pc only)
#GRUB_TERMINAL=console
# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480
# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true
# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"
# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"
I followed a series of different posts all mashed together to form some kind of cogent guide.
The exact steps are as follows:
ON PROXMOX:
- enable IOMMU in BIOS (enabled by default)
- by default SVM mode and SR-IOV are also also enabled by default in the BIOS (there are no visible options for disabling CSM)
- set the following kernel parameters in Grub on Proxmox: amd_iommu=on, iommu=pt, pcie_acs_override=downstream,multifunction nofb nomodeset video=vesafbff,efifbff
- add the following to /etc/moduels: vfio, vfio_iommu_type1, vfio_pci, vfio_virqfd
- test that remapping is enabled: dmesg | grep 'remapping'
- confirm dedicated groups: find /sys/kernel/iommu_groups/ -type l
- update-grub
- blacklist all drivers: echo "blacklist radeon" >> /etc/modprobe.d/blacklist.conf, echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf, echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf
- add the vendor ids to block host from taking device: echo "options vfio-pci ids=<id>:<id>,<id>:<id> disable_vga=1" > /etc/modprobe.d/vfio.conf
- enable interrupt remapping: echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > /etc/modprobe.d/iommu_unsafe_interrupts.conf echo "options kvm ignore_msrs=1" > /etc/modprobe.d/kvm.conf
- update-initramfs -u
- reset
ON THE GUEST:
- installed the following drivers: mesa, lib32-mesa, xf86-video-amdgpu, amdvlk, lib32-amdvlk, mesa-vdpau, lib32-mesa-vdpau
This is my guest VM config:
Arch Linux 5.15.8-arch1-1
Here is the output from lspci -v under the VGA controller:
Code:
00:10.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] (rev c7) (prog-if 00 [VGA controller])
Subsystem: Micro-Star International Co., Ltd. [MSI] Device 5025
Physical Slot: 16
Flags: bus master, fast devsel, latency 0, IRQ 42
Memory at 800000000 (64-bit, prefetchable) [size=256M]
Memory at 810000000 (64-bit, prefetchable) [size=2M]
I/O ports at 1000 [size=256]
Memory at c1400000 (32-bit, non-prefetchable) [size=1M]
Expansion ROM at c1560000 [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: amdgpu
Kernel modules: amdgpu
Last edited: