e1000e unexpected adapter resets

surly · May 18, 2021

Hello:

I have been experiencing this for ~1.5 years through several versions of proxmox. Workarounds work, but I'm inquiring about a fix.

The system is an Intel NUC8i5BEH latest BIOS "BECFL357.86A.0087.2020.1209.1115 12/09/2020".
The host OS is currently up to date, proxmox-ve 6.4-1.
I have loaded optional kernel 5.11.17-1-pve at the recommendation of some because 5.11 solved their Ethernet problems.
The onboard NIC is I219-V

Code:

00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (6) I219-V (rev 30)
        Subsystem: Intel Corporation Ethernet Connection (6) I219-V
        Flags: bus master, fast devsel, latency 0, IRQ 137
        Memory at c0b00000 (32-bit, non-prefetchable) [size=128K]
        Capabilities: [c8] Power Management version 3
        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Kernel driver in use: e1000e
        Kernel modules: e1000e

As mentioned, this has been present on multiple versions of PVE and multiple kernels. I am not positive what version of e1000e is included with 5.11.17-1-pve but it has changed since the half dozen kernel revisions preceding it (or at least it is not reporting a 3.2.x-y revision format):

Code:

# modinfo -k 5.11.17-1-pve  e1000e 
filename:       /lib/modules/5.11.17-1-pve/kernel/drivers/net/ethernet/intel/e1000e/e1000e.ko
license:        GPL v2
description:    Intel(R) PRO/1000 Network Driver
author:         Intel Corporation, <linux.nics@intel.com>
srcversion:     8543CA62F65379D0D09CCD6

root@pve01:~# modinfo -k 5.4.114-1-pve  e1000e
filename:       /lib/modules/5.4.114-1-pve/kernel/drivers/net/ethernet/intel/e1000e/e1000e.ko
version:        3.2.6-k
license:        GPL v2
description:    Intel(R) PRO/1000 Network Driver
author:         Intel Corporation, <linux.nics@intel.com>
srcversion:     A9698026892EE8F2061C993

The problem can be triggered by using iperf in a guest ubuntu OS: "iperf3 -c -t 240 -P8".
Running the same from the host OS doesn't seem to trigger the issue.
I have "VLAN aware" configured on the interface.
The workaround is to disable TSO with 'ethtool -k eno1 tso off'.

I did not have this problem when running ubuntu LTS 18.04 on the same hardware

When the problem is triggered, these messages are logged:

Code:

[Tue May 18 15:14:19 2021] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                             TDH                  <2e>
                             TDT                  <57>
                             next_to_use          <57>
                             next_to_clean        <2d>
                           buffer_info[next_to_clean]:
                             time_stamp           <10524cc6e>
                             next_to_watch        <2e>
                             jiffies              <10524ce60>
                             next_to_watch.status <0>
                           MAC Status             <40080083>
                           PHY Status             <796d>
                           PHY 1000BASE-T Status  <3800>
                           PHY Extended Status    <3000>
                           PCI Status             <10>
[Tue May 18 15:14:21 2021] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                             TDH                  <2e>
                             TDT                  <57>
                             next_to_use          <57>
                             next_to_clean        <2d>
                           buffer_info[next_to_clean]:
                             time_stamp           <10524cc6e>
                             next_to_watch        <2e>
                             jiffies              <10524d059>
                             next_to_watch.status <0>
                           MAC Status             <40080083>
                           PHY Status             <796d>
                           PHY 1000BASE-T Status  <3800>
                           PHY Extended Status    <3000>
                           PCI Status             <10>
[Tue May 18 15:14:23 2021] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                             TDH                  <2e>
                             TDT                  <57>
                             next_to_use          <57>
                             next_to_clean        <2d>
                           buffer_info[next_to_clean]:
                             time_stamp           <10524cc6e>
                             next_to_watch        <2e>
                             jiffies              <10524d248>
                             next_to_watch.status <0>
                           MAC Status             <40080083>
                           PHY Status             <796d>
                           PHY 1000BASE-T Status  <3800>
                           PHY Extended Status    <3000>
                           PCI Status             <10>
[Tue May 18 15:14:25 2021] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                             TDH                  <2e>
                             TDT                  <57>
                             next_to_use          <57>
                             next_to_clean        <2d>
                           buffer_info[next_to_clean]:
                             time_stamp           <10524cc6e>
                             next_to_watch        <2e>
                             jiffies              <10524d440>
                             next_to_watch.status <0>
                           MAC Status             <40080083>
                           PHY Status             <796d>
                           PHY 1000BASE-T Status  <3800>
                           PHY Extended Status    <3000>
                           PCI Status             <10>
[Tue May 18 15:14:26 2021] e1000e 0000:00:1f.6 eno1: Reset adapter unexpectedly
[Tue May 18 15:14:27 2021] vmbr0: port 1(eno1) entered disabled state
[Tue May 18 15:14:32 2021] e1000e 0000:00:1f.6 eno1: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[Tue May 18 15:14:32 2021] vmbr0: port 1(eno1) entered blocking state
[Tue May 18 15:14:32 2021] vmbr0: port 1(eno1) entered forwarding state

Is this a bug in e1000e? Is it fixed in the "current" revision of e1000e, but PVE does not include that revision at this time, or is it still broken in the current revision? The "PC vendor" is Intel themselves, if this is Intel's problem how do I convince them of that?

Thanks

spirit · May 22, 2021

This a a known bug with intel nic model (I219-V chipset) in NUC since years (just search intel - nuc - reset in the forum)

you need to try to disable offloading features

apt install -y ethtool
ethtool -K eth0 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off

fpausp · May 22, 2021

To get ethtool changes permanent you can edit /etc/network/interfaces like this:

Code:

auto vmbr0
iface vmbr0 inet static
        address 192.168.xxx.xxx/24
        gateway 192.168.xxx.xxx
        bridge-ports enp5s0
        bridge-stp off
        bridge-fd 0
        pre-up /sbin/ethtool --offload vmbr0 gso off tso off sg off gro off
        pre-up /sbin/ethtool --offload enp0s25 gso off tso off sg off gro off

surly · May 22, 2021

Thank you both. I am aware of the workarounds - TSO off seems to be enough to stabilize my installation. When I heard that 5.11.x fixed a lot of e1000e problems for various platforms I was hopeful that it was fixed without workarounds.

But - here's my catch. I experienced none of these problems on ubuntu 18.04 LTS before switching to proxmox yet under proxmox this reared its head within a day. [ Admittedly I would have to go bare metal with ubuntu again to see if they have defaulted to TSO off or something similar. ]

Is this really an I219-V "bug" that simply cannot be fixed. Or, is it an e1000e bug that someone just needs to be convinced to fix? Or, is it already fixed in current revisions of e1000e and proxmox is still using an older revision without the fix.

I'm no developer by any stretch, but it looks to me from sourceforge like the current stable e1000e is 3.8.7. The latest proxmox 5.4 kernel includes 3.2.6 which I would estimate is ~2015 based on sourceforge again. The latest 5.11 kernel available with proxmox removed the 3.x.y revision formatting so I have no idea what revision it is but dmesg logs "[Fri May 14 15:16:17 2021] e1000e: Copyright(c) 1999 - 2015 Intel Corporation" so I'm going to guess it's 6 years old too.

So - if there's 6 years of patching and development to e1000e since the latest included with proxmox, maybe the problem was actually fixed years ago?

spirit · May 23, 2021

I think this is really an hardware bug in in I219-v. I have seen some workaround patch of udp in kernel some year ago
https://github.com/torvalds/linux/commit/b10effb92e272051dd1ec0d7be56bf9ca85ab927

and almost all reported bugs are always with theses chipsets or nuc platform.

surly · May 23, 2021

Well, nuts... Thanks for your perspectives...

EDIT: I should also point out - when I was doing some traffic tests using iperf3 which could trigger the issue, it only happened when the load was to/from a VM, not the proxmox host itself.

Not sure if that changes any impressions of what's going on.

Search

Search

e1000e unexpected adapter resets

surly

Member

spirit

Distinguished Member

fpausp

Renowned Member

surly

Member

spirit

Distinguished Member

surly

Member