PCIe Passthrough devices with error

hodor · Jun 13, 2023

Computer Specs:
CPU: Intel N100
RAM: 8GB
Storage: 250gb SSD

VM Config:

Code:

boot: order=scsi0;ide2;net0
cores: 4
hostpci0: 0000:01:00.0
hostpci1: 0000:02:00.0
hostpci2: 0000:03:00.0
ide2: local:iso/debian-11.7.0-amd64-netinst.iso,media=cdrom,size=389M
memory: 2048
meta: creation-qemu=7.2.0,ctime=1686634272
name: Debian
net0: virtio=F2:58:B9:29:6C:3C,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-lvm:vm-100-disk-0,iothread=1,size=32G
scsihw: virtio-scsi-single
smbios1: uuid=a7dfe925-96c8-414c-888f-ebdb956e210c
sockets: 1
vmgenid: c7756bff-3cce-476d-b925-7a4a89efe57f

My Grub config:

Code:

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"

When the VM starts, the connection is lost, and the console outputs the following error：

Code:

igb 0000:04:00.0 enp4s0: PCIe link lost, device now detached
vfio-pci 0000:03:00.0: Unable to change power state from d3cold to d0 device inaccessible
vfio-pci 0000:03:00.0: Unable to change power state from d3cold to d0 device inaccessible

0000:01:00.0 ~ 0000:04:00.0 is network card device

leesteken · Jun 13, 2023

Check you IOMMU groups. Maybe all four network devices are in the same group and therefore cannot be shared between VMs and the Proxmox host. That would explain why the host loses 04:00.0/enp4s0. Try passthrough of all four to the same VM and don't use one for Proxmox. But check the groups first to see if there are more devices in the same group (except for PCI Bridges).
EDIT: This gives a nice overview of the groups:

for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done

hodor · Jun 13, 2023

leesteken said:
Check you IOMMU groups. Maybe all four network devices are in the same group and therefore cannot be shared between VMs and the Proxmox host. That would explain why the host loses 04:00.0/enp4s0. Try passthrough of all four to the same VM and don't use one for Proxmox. But check the groups first to see if there are more devices in the same group (except for PCI Bridges).
EDIT: This gives a nice overview of the groups: for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done

Each of my network devices is in a different group. On my machine, their IOMMU groups are 11, 12, 13, and 14, respectively.With kernel version 5.15.102-1, I was able to passthrough three PCIe devices successfully without any issues. However, after upgrading the kernel version, I encountered a problem where I could add the first and second devices successfully, but when attempting to add the third device, the system fails to boot.

Code:

IOMMU group 11 01:00.0 Ethernet controller [0200]: Intel Corporation Device [8086:125c] (rev 04)
IOMMU group 12 02:00.0 Ethernet controller [0200]: Intel Corporation Device [8086:125c] (rev 04)
IOMMU group 13 03:00.0 Ethernet controller [0200]: Intel Corporation Device [8086:125c] (rev 04)
IOMMU group 14 04:00.0 Ethernet controller [0200]: Intel Corporation Device [8086:125c] (rev 04)

dcsapak · Jun 13, 2023

to which kernel did you upgrade? 6.2? could you maybe boot once with the older kernel and post both the iommugroups and the output of 'dmesg' from both the old and new kernel?

hodor · Jun 13, 2023

dcsapak said:
to which kernel did you upgrade? 6.2? could you maybe boot once with the older kernel and post both the iommugroups and the output of 'dmesg' from both the old and new kernel?

from 5.15.102-1-pve to 6.2.9-1-pve. The attached file contains the logs you need.

sorry, perhaps I misunderstood you. Are you referring to starting the PVE host once, or starting a problematic virtual machine once?

EDIT:
running vm.txt contains vm running logs

dcsapak · Jun 14, 2023

thanks, do you also have dmesg from the old kernel where the vm was running?

in any case, it might be an issue with the driver being unloaded (and the 6.2 kernel containing a newer driver that might behave differently), you could try to see if there is a firmware upgrade for the card
and you can try to blacklist the devices/driver so that it does not get bound to it in the first place, see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_pci_passthrough

hodor · Jun 14, 2023

dcsapak said:
thanks, do you also have dmesg from the old kernel where the vm was running?

in any case, it might be an issue with the driver being unloaded (and the 6.2 kernel containing a newer driver that might behave differently), you could try to see if there is a firmware upgrade for the card
and you can try to blacklist the devices/driver so that it does not get bound to it in the first place, see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_pci_passthrough

this is the dmesg from the kernel version 5.15.102-1-pve where the virtual machine was running.

thank you for your suggestion. I will try some additional troubleshooting steps to resolve this issue and provide you with more information.

dcsapak · Jun 14, 2023

ok so comparing the two runs, it really seems like it's a driver issue, since the only difference that seems to matter is the kernel warning and the 'pcie link lost' message
i'd really suggest to blacklist the driver if possible or bind the devices to vfio-pci early to prevent any weird driver interaction

wshgeek · Jun 15, 2023

dcsapak said:
ok so comparing the two runs, it really seems like it's a driver issue, since the only difference that seems to matter is the kernel warning and the 'pcie link lost' message
i'd really suggest to blacklist the driver if possible or bind the devices to vfio-pci early to prevent any weird driver interaction

I also have such a situation, lowering the core can pass through

hodor · Jun 15, 2023

dcsapak said:
ok so comparing the two runs, it really seems like it's a driver issue, since the only difference that seems to matter is the kernel warning and the 'pcie link lost' message
i'd really suggest to blacklist the driver if possible or bind the devices to vfio-pci early to prevent any weird driver interaction

thank you for your suggestion. i will give it a try.
also, I'd like to inquire, will this issue be addressed or fixed in future versions?

Search

Search

PCIe Passthrough devices with error

hodor

New Member

leesteken

Distinguished Member

hodor

New Member

dcsapak

Proxmox Staff Member

hodor

New Member

Attachments

dcsapak

Proxmox Staff Member

hodor

New Member

Attachments

dcsapak

Proxmox Staff Member

wshgeek

New Member

hodor

New Member