Proxmox Host loses connection on management nic, reboot fixes it

zehetal

New Member
Apr 10, 2024
5
0
1
We are using a simple Proxmox 8.4.x cluster with different hardware on the host. Since the last update we have an issue with the management nic on our primary Linux bridge vmbr0.
This bridge ist used for management and the primary nic on our VM´s. From working several days without any problem to working several hours we have this issue, the NIC goes offline (not physical as our network administrator is arguing). After rebooting the system, the NIC is working again for mostly several days.
In the hosts System log there is this kernel message:

Apr 02 10:02:38 prox3 kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
TDH <9c>
TDT <d3>
next_to_use <d3>
next_to_clean <9b>
buffer_info[next_to_clean]:
time_stamp <113a03870>
next_to_watch <9c>
jiffies <113a30500>
next_to_watch.status <0>
MAC Status <80083>
PHY Status <796d>
PHY 1000BASE-T Status <3800>
PHY Extended Status <3000>
PCI Status <10>

According to the vendor Intel the NIC seems physically to be OK.
Does anyone know such issue about the Intel nics with Proxmox 8.4.x
All other network connections on the host (we have several of them) are working, but if we change the vmbr0 to another NIC the reboot doesn´t fix the issue anymore and losing connection stays on
 
That kernel message is an e1000e "Hardware Unit Hang" - the transmit descriptor ring gets stuck (TDH/TDT mismatch means the hardware head pointer is behind the tail, so it stopped consuming packets). The driver detects this and tries to recover but sometimes cannot without a full reset, hence the reboot fixes it.

This is a known issue with the Intel I219-V/LM series and has resurfaced with newer kernel versions. Two things worth trying:

First, disable Energy Efficient Ethernet (EEE) on that NIC - the I219 is notorious for going into a low-power state and not coming back cleanly:

ethtool --set-eee enp0s31f6 eee off

To make this persistent, add it as a post-up rule in /etc/network/interfaces under vmbr0:

post-up ethtool --set-eee enp0s31f6 eee off

Second, if EEE is already off or that does not help, create /etc/modprobe.d/e1000e.conf with:

options e1000e SmartPowerDownEnable=0

Then run: update-initramfs -u -k all

To rule out interrupt coalescing as the cause, also try: ethtool -C enp0s31f6 rx-usecs 0 tx-usecs 0

A few useful commands to share the full picture if the problem continues: pveversion -v, uname -r, ethtool -i enp0s31f6 (shows driver and firmware versions), and dmesg | grep -E "e1000e|enp0s31f6" from around the time it hangs.

The fact that switching vmbr0 to another NIC and rebooting still does not help suggests the issue might be specifically on that physical port or the I219 chip itself - what NIC model is it?
 
That kernel message is an e1000e "Hardware Unit Hang" - the transmit descriptor ring gets stuck (TDH/TDT mismatch means the hardware head pointer is behind the tail, so it stopped consuming packets). The driver detects this and tries to recover but sometimes cannot without a full reset, hence the reboot fixes it.

This is a known issue with the Intel I219-V/LM series and has resurfaced with newer kernel versions. Two things worth trying:

First, disable Energy Efficient Ethernet (EEE) on that NIC - the I219 is notorious for going into a low-power state and not coming back cleanly:

ethtool --set-eee enp0s31f6 eee off

To make this persistent, add it as a post-up rule in /etc/network/interfaces under vmbr0:

post-up ethtool --set-eee enp0s31f6 eee off

Second, if EEE is already off or that does not help, create /etc/modprobe.d/e1000e.conf with:

options e1000e SmartPowerDownEnable=0

Then run: update-initramfs -u -k all

To rule out interrupt coalescing as the cause, also try: ethtool -C enp0s31f6 rx-usecs 0 tx-usecs 0

A few useful commands to share the full picture if the problem continues: pveversion -v, uname -r, ethtool -i enp0s31f6 (shows driver and firmware versions), and dmesg | grep -E "e1000e|enp0s31f6" from around the time it hangs.

The fact that switching vmbr0 to another NIC and rebooting still does not help suggests the issue might be specifically on that physical port or the I219 chip itself - what NIC model is it?
I will work out all propositions made by you. Since the error takes place after several days, i will give advice if it works in few days.