NIC not working: PCIe link lost, device now detached

richi44

New Member
Sep 25, 2024
1
0
1
Hi.

I am using Proxmox VE on two identical machines HP Z2 G9 with added HP 2.5GbE LAN Flex Port. First machine runs without any problems but the sedond one has problem with added pcie nic card HP 2.5GbE LAN Flex Port. NIC is in Linux bridge vmbr1 briging physical port enp1s0 and VLAN aware enabled. This setup works great on the first machine. I do not know if the problem is related to hw or is it sw isssue?

network setup

Code:
cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface eno1 inet manual

iface enp1s0 inet manual

auto vmbr0
iface vmbr0 inet static
    address 10.20.30.17/24
    gateway 10.20.30.1
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0
#1GB management

auto vmbr1
iface vmbr1 inet manual
    bridge-ports enp1s0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
#2.5GB trunk

logs showing time when pcie link was lost

Code:
journalctl -o short-precise -all | grep "PCIe link lost"
Sep 14 10:01:06.900018 pve07 kernel: igc 0000:02:00.0 enp2s0: PCIe link lost, device now detached
Sep 16 07:16:35.148158 pve07 kernel: igc 0000:01:00.0 enp1s0: PCIe link lost, device now detached
Sep 23 02:03:54.181872 pve07 kernel: igc 0000:01:00.0 enp1s0: PCIe link lost, device now detached
Sep 30 02:30:06.133725 pve07 kernel: igc 0000:01:00.0 enp1s0: PCIe link lost, device now detached
Oct 05 14:04:58.167852 pve07 kernel: igc 0000:01:00.0 enp1s0: PCIe link lost, device now detached
Oct 09 05:34:55.515763 pve07 kernel: igc 0000:01:00.0 enp1s0: PCIe link lost, device now detached

ip a before and after restart

Code:
2: enp1s0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master vmbr1 state UP group default qlen 1000
    link/ether 30:13:8b:84:ca:f1 brd ff:ff:ff:ff:ff:ff

2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr1 state UP group default qlen 1000
    link/ether 30:13:8b:84:ca:f1 brd ff:ff:ff:ff:ff:ff



There is also restart problem, it has to be done physically because restarting over management nic will cause management NIC is not accessible over SSH, but ping is working. Probably reboot will not occur at all with command (need to connect display to servver).
 
There have been other issues reported with this driver (just search the web for "PCIe link lost, device now detached".

The majority are on Asus machines like STRIX X670E-F, and in many cases the "pcie_port_pm=off pcie_aspm.policy=performance" kernel command-line parameters are a workaround.

I suspect an issue with the Linux PCI ASPM code and would be interested in whether those parameters help this machine as well. Even if they help, that's not the final solution and we still need to fix the underlying problem. If you could open a report at https://bugzilla.kernel.org, attach the complete dmesg log and "sudo lspci -vv" output, and assign it to me (bjorn@helgaas.com), I'd like to take a look at it. I don't monitor this forum, so I may miss updates here.