NIC PCI passthrough failure

Cyrus

New Member
Dec 27, 2022
12
3
3
cyrusyip.org
Device: Beelink EQ59 N5105 (two NICs)
Proxmox version: 7.3-4

Code:
root@pve ~# lspci
00:00.0 Host bridge: Intel Corporation Device 4e24
00:02.0 VGA compatible controller: Intel Corporation Device 4e61 (rev 01)
00:04.0 Signal processing controller: Intel Corporation Device 4e03
00:14.0 USB controller: Intel Corporation Device 4ded (rev 01)
00:14.2 RAM memory: Intel Corporation Device 4def (rev 01)
00:15.0 Serial bus controller [0c80]: Intel Corporation Device 4de8 (rev 01)
00:15.1 Serial bus controller [0c80]: Intel Corporation Device 4de9 (rev 01)
00:15.2 Serial bus controller [0c80]: Intel Corporation Device 4dea (rev 01)
00:15.3 Serial bus controller [0c80]: Intel Corporation Device 4deb (rev 01)
00:16.0 Communication controller: Intel Corporation Device 4de0 (rev 01)
00:17.0 SATA controller: Intel Corporation Device 4dd3 (rev 01)
00:19.0 Serial bus controller [0c80]: Intel Corporation Device 4dc5 (rev 01)
00:19.1 Serial bus controller [0c80]: Intel Corporation Device 4dc6 (rev 01)
00:1c.0 PCI bridge: Intel Corporation Device 4dbc (rev 01)
00:1c.5 PCI bridge: Intel Corporation Device 4dbd (rev 01)
00:1c.6 PCI bridge: Intel Corporation Device 4dbe (rev 01)
00:1e.0 Communication controller: Intel Corporation Device 4da8 (rev 01)
00:1e.3 Serial bus controller [0c80]: Intel Corporation Device 4dab (rev 01)
00:1f.0 ISA bridge: Intel Corporation Device 4d87 (rev 01)
00:1f.3 Audio device: Intel Corporation Device 4dc8 (rev 01)
00:1f.4 SMBus: Intel Corporation Device 4da3 (rev 01)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Device 4da4 (rev 01)
01:00.0 Network controller: Intel Corporation Wireless 3165 (rev 81)
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev ff)

I install the latest kernel 6.1 on Proxmox and passthroughed a NIC as PCIe device to OpenWrt virtual machine. OpenWrt VM worked for about 6 days. It freezed suddenly yesterday. Proxmox web interface showed "internal-error" on the VM. I accessed it in console but can not input anything. I stoped the VM and started it. It failed to start.

Yesterday, I upgraded Proxmox, removed all 6.1 and 5.19 kernels, rebooted Proxmox. Proxmox was running 5.15 kernel. The VM also ran correctly. Today, the VM freezed again and Proxmox showed "internal-error". I stopped and started the VM. It failed again.

Error message:

Code:
kvm: ../hw/pci/pci.c:1562: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
TASK ERROR: start failed: QEMU exited with code 1

VM config(excerpt):

Code:
bios: ovmf
boot: order=scsi0
cores: 3
cpu: host
efidisk0: local-lvm:vm-101-disk-2,efitype=4m,size=4M
hostpci0: 0000:03:00,pcie=1
machine: q35
memory: 2048
meta: creation-qemu=6.2.0,ctime=1668389097
name: OpenWrt
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-101-disk-0,size=6164M
scsihw: virtio-scsi-pci
smbios1: uuid=0f9be60c-d319-4afb-a0b5-f9c3621b9b3f
sockets: 1
startup: order=1
unused0: local-lvm:vm-101-disk-1

grub config(excerpt):
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"

/etc/modules :
Code:
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

duplicate line are removed

Code:
-- Journal begins at Sun 2022-11-13 23:39:21 CST, ends at Tue 2022-12-27 15:32:25 CST. --
Dec 27 15:32:11 pve kernel: vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0xffff@0xffc
# the line above are repeated 927 times
Dec 27 15:32:11 pve kernel: vfio-pci 0000:03:00.0: can't change power state from D3cold to D0 (config space inaccessible)
Dec 27 15:32:11 pve kernel: pcieport 0000:00:1c.6: DPC: Data Link Layer Link Active not set in 1000 msec
Dec 27 15:32:11 pve kernel: pcieport 0000:00:1c.6: AER: subordinate device reset failed
Dec 27 15:32:11 pve kernel: pcieport 0000:00:1c.6: AER: device recovery failed
Dec 27 15:32:11 pve kernel: pcieport 0000:00:1c.6: DPC: containment event, status:0x1f01 source:0x0000
Dec 27 15:32:11 pve kernel: pcieport 0000:00:1c.6: DPC: unmasked uncorrectable error detected
Dec 27 15:32:11 pve kernel: pcieport 0000:00:1c.6: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Dec 27 15:32:11 pve kernel: pcieport 0000:00:1c.6:   device [8086:4dbe] error status/mask=00100000/00000000
Dec 27 15:32:11 pve kernel: pcieport 0000:00:1c.6:    [20] UnsupReq               (First)
Dec 27 15:32:11 pve kernel: pcieport 0000:00:1c.6: AER:   TLP Header: 34000000 03000010 00000000 90039003
Dec 27 15:32:12 pve pvestatd[1222]: storage 'hi' is not online
Dec 27 15:32:12 pve kernel: vfio-pci 0000:03:00.0: invalid power transition (from D3cold to D3hot)
Dec 27 15:32:12 pve pvedaemon[259909]: VM 101 qmp command failed - VM 101 not running
Dec 27 15:32:12 pve systemd[1]: 101.scope: Succeeded.
░░ Subject: Unit succeeded
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit 101.scope has successfully entered the 'dead' state.
Dec 27 15:32:12 pve pvedaemon[269302]: start failed: QEMU exited with code 1
Dec 27 15:32:12 pve pvedaemon[265545]: <root@pam> end task UPID:pve:00041BF6:0068A1CD:63AA9F78:qmstart:101:root@pam: start failed: QEMU exited with code 1
Dec 27 15:32:14 pve kernel: pcieport 0000:00:1c.6: DPC: Data Link Layer Link Active not set in 1000 msec
Dec 27 15:32:14 pve kernel: pcieport 0000:00:1c.6: AER: subordinate device reset failed
Dec 27 15:32:14 pve kernel: pcieport 0000:00:1c.6: AER: device recovery failed
Dec 27 15:32:14 pve kernel: pcieport 0000:00:1c.6: DPC: containment event, status:0x1f01 source:0x0000
Dec 27 15:32:14 pve kernel: pcieport 0000:00:1c.6: DPC: unmasked uncorrectable error detected
Dec 27 15:32:14 pve kernel: pcieport 0000:00:1c.6: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Dec 27 15:32:14 pve kernel: pcieport 0000:00:1c.6:   device [8086:4dbe] error status/mask=00100000/00000000
Dec 27 15:32:14 pve kernel: pcieport 0000:00:1c.6:    [20] UnsupReq               (First)
Dec 27 15:32:14 pve kernel: pcieport 0000:00:1c.6: AER:   TLP Header: 34000000 03000010 00000000 90039003
Dec 27 15:32:15 pve pvestatd[1222]: storage 'we' is not online
Dec 27 15:32:16 pve kernel: pcieport 0000:00:1c.6: DPC: Data Link Layer Link Active not set in 1000 msec
Dec 27 15:32:16 pve kernel: pcieport 0000:00:1c.6: AER: subordinate device reset failed
Dec 27 15:32:16 pve kernel: pcieport 0000:00:1c.6: AER: device recovery failed
Dec 27 15:32:16 pve kernel: pcieport 0000:00:1c.6: DPC: containment event, status:0x1f01 source:0x0000
Dec 27 15:32:16 pve kernel: pcieport 0000:00:1c.6: DPC: unmasked uncorrectable error detected
Dec 27 15:32:16 pve kernel: pcieport 0000:00:1c.6: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Dec 27 15:32:16 pve kernel: pcieport 0000:00:1c.6:   device [8086:4dbe] error status/mask=00100000/00000000
Dec 27 15:32:16 pve kernel: pcieport 0000:00:1c.6:    [20] UnsupReq               (First)
Dec 27 15:32:16 pve kernel: pcieport 0000:00:1c.6: AER:   TLP Header: 34000000 03000010 00000000 90039003
Dec 27 15:32:19 pve kernel: pcieport 0000:00:1c.6: DPC: Data Link Layer Link Active not set in 1000 msec
Dec 27 15:32:19 pve kernel: pcieport 0000:00:1c.6: AER: subordinate device reset failed
Dec 27 15:32:19 pve kernel: pcieport 0000:00:1c.6: AER: device recovery failed
Dec 27 15:32:19 pve kernel: pcieport 0000:00:1c.6: DPC: containment event, status:0x1f01 source:0x0000
Dec 27 15:32:19 pve kernel: pcieport 0000:00:1c.6: DPC: unmasked uncorrectable error detected
Dec 27 15:32:19 pve kernel: pcieport 0000:00:1c.6: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Dec 27 15:32:19 pve kernel: pcieport 0000:00:1c.6:   device [8086:4dbe] error status/mask=00100000/00000000
Dec 27 15:32:19 pve kernel: pcieport 0000:00:1c.6:    [20] UnsupReq               (First)
Dec 27 15:32:19 pve kernel: pcieport 0000:00:1c.6: AER:   TLP Header: 34000000 03000010 00000000 90039003
Dec 27 15:32:21 pve kernel: pcieport 0000:00:1c.6: DPC: Data Link Layer Link Active not set in 1000 msec
Dec 27 15:32:21 pve kernel: pcieport 0000:00:1c.6: AER: subordinate device reset failed
Dec 27 15:32:21 pve kernel: pcieport 0000:00:1c.6: AER: device recovery failed
Dec 27 15:32:21 pve kernel: pcieport 0000:00:1c.6: DPC: containment event, status:0x1f01 source:0x0000
Dec 27 15:32:21 pve kernel: pcieport 0000:00:1c.6: DPC: unmasked uncorrectable error detected
Dec 27 15:32:21 pve kernel: pcieport 0000:00:1c.6: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Dec 27 15:32:21 pve kernel: pcieport 0000:00:1c.6:   device [8086:4dbe] error status/mask=00100000/00000000
Dec 27 15:32:21 pve kernel: pcieport 0000:00:1c.6:    [20] UnsupReq               (First)
Dec 27 15:32:21 pve kernel: pcieport 0000:00:1c.6: AER:   TLP Header: 34000000 03000010 00000000 90039003
Dec 27 15:32:23 pve kernel: pcieport 0000:00:1c.6: DPC: Data Link Layer Link Active not set in 1000 msec
Dec 27 15:32:23 pve kernel: pcieport 0000:00:1c.6: AER: subordinate device reset failed
Dec 27 15:32:23 pve kernel: pcieport 0000:00:1c.6: AER: device recovery failed
Dec 27 15:32:23 pve kernel: pcieport 0000:00:1c.6: DPC: containment event, status:0x1f01 source:0x0000
Dec 27 15:32:23 pve kernel: pcieport 0000:00:1c.6: DPC: unmasked uncorrectable error detected
Dec 27 15:32:23 pve kernel: pcieport 0000:00:1c.6: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Dec 27 15:32:23 pve kernel: pcieport 0000:00:1c.6:   device [8086:4dbe] error status/mask=00100000/00000000
Dec 27 15:32:23 pve kernel: pcieport 0000:00:1c.6:    [20] UnsupReq               (First)
Dec 27 15:32:23 pve kernel: pcieport 0000:00:1c.6: AER:   TLP Header: 34000000 03000010 00000000 90039003
Dec 27 15:32:24 pve pvestatd[1222]: storage 'hi' is not online
Dec 27 15:32:25 pve kernel: pcieport 0000:00:1c.6: DPC: Data Link Layer Link Active not set in 1000 msec
Dec 27 15:32:25 pve kernel: pcieport 0000:00:1c.6: AER: subordinate device reset failed
Dec 27 15:32:25 pve kernel: pcieport 0000:00:1c.6: AER: device recovery failed
Dec 27 15:32:25 pve kernel: pcieport 0000:00:1c.6: DPC: containment event, status:0x1f01 source:0x0000
Dec 27 15:32:25 pve kernel: pcieport 0000:00:1c.6: DPC: unmasked uncorrectable error detected
Dec 27 15:32:25 pve kernel: pcieport 0000:00:1c.6: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Dec 27 15:32:25 pve kernel: pcieport 0000:00:1c.6:   device [8086:4dbe] error status/mask=00100000/00000000
Dec 27 15:32:25 pve kernel: pcieport 0000:00:1c.6:    [20] UnsupReq               (First)
Dec 27 15:32:25 pve kernel: pcieport 0000:00:1c.6: AER:   TLP Header: 34000000 03000010 00000000 90039003
 
Last edited:
I isolated the NIC and rebooted Proxmox. The problem still exists.

Code:
root@pve ~# cat /etc/modprobe.d/vfio.conf
# isolate 03:00.0 NIC
options vfio-pci ids=10ec:8168

root@pve ~# update-initramfs -u -k all; update-grub
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!