PCIe Passthrough devices with error

hodor

New Member
Jun 13, 2023
5
0
1
Computer Specs:
CPU: Intel N100
RAM: 8GB
Storage: 250gb SSD

VM Config:
Code:
boot: order=scsi0;ide2;net0
cores: 4
hostpci0: 0000:01:00.0
hostpci1: 0000:02:00.0
hostpci2: 0000:03:00.0
ide2: local:iso/debian-11.7.0-amd64-netinst.iso,media=cdrom,size=389M
memory: 2048
meta: creation-qemu=7.2.0,ctime=1686634272
name: Debian
net0: virtio=F2:58:B9:29:6C:3C,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: local-lvm:vm-100-disk-0,iothread=1,size=32G
scsihw: virtio-scsi-single
smbios1: uuid=a7dfe925-96c8-414c-888f-ebdb956e210c
sockets: 1
vmgenid: c7756bff-3cce-476d-b925-7a4a89efe57f

My Grub config:
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on"

When the VM starts, the connection is lost, and the console outputs the following error:
Code:
igb 0000:04:00.0 enp4s0: PCIe link lost, device now detached
vfio-pci 0000:03:00.0: Unable to change power state from d3cold to d0 device inaccessible
vfio-pci 0000:03:00.0: Unable to change power state from d3cold to d0 device inaccessible

0000:01:00.0 ~ 0000:04:00.0 is network card device
 
Check you IOMMU groups. Maybe all four network devices are in the same group and therefore cannot be shared between VMs and the Proxmox host. That would explain why the host loses 04:00.0/enp4s0. Try passthrough of all four to the same VM and don't use one for Proxmox. But check the groups first to see if there are more devices in the same group (except for PCI Bridges).
EDIT: This gives a nice overview of the groups: for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done
 
Last edited:
Check you IOMMU groups. Maybe all four network devices are in the same group and therefore cannot be shared between VMs and the Proxmox host. That would explain why the host loses 04:00.0/enp4s0. Try passthrough of all four to the same VM and don't use one for Proxmox. But check the groups first to see if there are more devices in the same group (except for PCI Bridges).
EDIT: This gives a nice overview of the groups: for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done
Each of my network devices is in a different group. On my machine, their IOMMU groups are 11, 12, 13, and 14, respectively.With kernel version 5.15.102-1, I was able to passthrough three PCIe devices successfully without any issues. However, after upgrading the kernel version, I encountered a problem where I could add the first and second devices successfully, but when attempting to add the third device, the system fails to boot.

Code:
IOMMU group 11 01:00.0 Ethernet controller [0200]: Intel Corporation Device [8086:125c] (rev 04)
IOMMU group 12 02:00.0 Ethernet controller [0200]: Intel Corporation Device [8086:125c] (rev 04)
IOMMU group 13 03:00.0 Ethernet controller [0200]: Intel Corporation Device [8086:125c] (rev 04)
IOMMU group 14 04:00.0 Ethernet controller [0200]: Intel Corporation Device [8086:125c] (rev 04)
 
Last edited:
to which kernel did you upgrade? 6.2? could you maybe boot once with the older kernel and post both the iommugroups and the output of 'dmesg' from both the old and new kernel?
 
to which kernel did you upgrade? 6.2? could you maybe boot once with the older kernel and post both the iommugroups and the output of 'dmesg' from both the old and new kernel?
from 5.15.102-1-pve to 6.2.9-1-pve. The attached file contains the logs you need.

sorry, perhaps I misunderstood you. Are you referring to starting the PVE host once, or starting a problematic virtual machine once?

EDIT:
running vm.txt contains vm running logs
 

Attachments

  • log.txt
    141 KB · Views: 8
  • running vm.txt
    80.9 KB · Views: 4
Last edited:
thanks, do you also have dmesg from the old kernel where the vm was running?

in any case, it might be an issue with the driver being unloaded (and the 6.2 kernel containing a newer driver that might behave differently), you could try to see if there is a firmware upgrade for the card
and you can try to blacklist the devices/driver so that it does not get bound to it in the first place, see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_pci_passthrough
 
thanks, do you also have dmesg from the old kernel where the vm was running?

in any case, it might be an issue with the driver being unloaded (and the 6.2 kernel containing a newer driver that might behave differently), you could try to see if there is a firmware upgrade for the card
and you can try to blacklist the devices/driver so that it does not get bound to it in the first place, see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#qm_pci_passthrough
this is the dmesg from the kernel version 5.15.102-1-pve where the virtual machine was running.

thank you for your suggestion. I will try some additional troubleshooting steps to resolve this issue and provide you with more information.
 

Attachments

  • Linux version 5.15.102-1-pve-running vm.txt
    153.7 KB · Views: 2
ok so comparing the two runs, it really seems like it's a driver issue, since the only difference that seems to matter is the kernel warning and the 'pcie link lost' message
i'd really suggest to blacklist the driver if possible or bind the devices to vfio-pci early to prevent any weird driver interaction
 
ok so comparing the two runs, it really seems like it's a driver issue, since the only difference that seems to matter is the kernel warning and the 'pcie link lost' message
i'd really suggest to blacklist the driver if possible or bind the devices to vfio-pci early to prevent any weird driver interaction
I also have such a situation, lowering the core can pass through
 
ok so comparing the two runs, it really seems like it's a driver issue, since the only difference that seems to matter is the kernel warning and the 'pcie link lost' message
i'd really suggest to blacklist the driver if possible or bind the devices to vfio-pci early to prevent any weird driver interaction
thank you for your suggestion. i will give it a try.
also, I'd like to inquire, will this issue be addressed or fixed in future versions?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!