e1000 driver hang

n1nj4888 · Mar 20, 2020

I'm still getting the same "Detected Hardware Unit Hang" errors sporadically when using PVE kernel 5.4.

Code:

Mar 19 20:11:15 pve-host1.local kernel: [30377.339967] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:

I recall there was previously some advice around setting: ethtool -K <ADAPTER> gso off gro off tso off ... (and maybe even? ) ... tx off rx off

can anyone who has one of these types of adapters and stopped the "Detected Hardware Hang" through the ethtool feature setting:

Confirm what the exact ethtool command should be?
Confirm how this should be applied as a "post up" (in /etc/network/interfaces?) so that the workaround is applied at each adapter reset/reboot?
Confirm whether they have seen any performance degradation from the above workaround?

Thanks!

Stoiko Ivanov · Mar 20, 2020

last user in the forum with that problem fixed it with:

Code:

ethtool -K <device name> gso off gro off tso off tx off rx off

see:
https://forum.proxmox.com/threads/i-have-a-issue-on-dedicated-lock.67157/#post-301787

n1nj4888 said:
Confirm how this should be applied as a "post up" (in /etc/network/interfaces?) so that the workaround is applied at each adapter reset/reboot?

adding a post-up line to the bridge config should work.

I hope this helps!

tssge · Mar 20, 2020

This has been fixed by patch https://patchwork.ozlabs.org/patch/1211740/ in kernel version 5.4.18 https://lwn.net/Articles/811637/

n1nj4888 · Mar 23, 2020

Stoiko Ivanov said:
adding a post-up line to the bridge config should work.

I hope this helps!

Thanks @Stoiko Ivanov - Could you let me know how and where I make this "post-up line"?

Stoiko Ivanov · Mar 23, 2020

that's described quite well online - e.g. in the debian wiki:
https://wiki.debian.org/NetworkConfiguration
(just add a 'post-up ethtool ....' line under the other configs for that interface (like gateway, address,....)

I hope this helps!

rewen · Mar 23, 2020

I, too, can confirm that ethtool -K eno1 tso off gso off mitigates the issue for me.

I have been having this issue for months now and did not realize it. I assumed it was a connection issue with the ISP or host (neither of which would even investigate). Turns out it was being logged in syslog the whole time and I was too stubborn to take a look:

The connection always came back on its own in my case, but still caused massive headaches. I'm very happy that it's mostly resolved at the moment.

That being said, there's no definition for eno1 in the /etc/network/interfaces file. So I added the post-up to vmbr0 instead, which seems like it would still work since presumably both come up at the same time?

Code:

# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo
iface lo inet loopback


# vmbr0: Bridging. Make sure to use only MAC adresses that were assigned to you.
auto vmbr0
iface vmbr0 inet static
        address 1.2.3.4/24
        gateway 1.2.3.4.254
        bridge_ports eno1
        bridge_stp off
        bridge_fd 0
        post-up ethtool -K eno1 tso off gso off

Is there a better way I should have done this?

Apollon77 · Mar 23, 2020

I have a one liner with eno1 in my file and added it to that ... no idea if correct :-(
Maybe someone can advice

Code:

iface eno1 inet manual
        post-up /sbin/ethtool -K eno1 tso off gso off

ambyjkl · Mar 23, 2020

I have this issue on Linux 5.5.9 with Intel Corporation Ethernet Connection (7) I219-LM adapter. I have not yet tried the mitigation through `ethtool -K eno1 tso off gso off`, but someone mentioned this was supposed to have been fixed on 5.4.18 and I'm running a newer kernel. I would like to not use the mitigation since I've read it severely affects network performance, which is key for my application. Does anyone have any suggestions? Thanks!

spirit · Mar 24, 2020

ambyjkl said:
I have this issue on Linux 5.5.9 with Intel Corporation Ethernet Connection (7) I219-LM adapter. I have not yet tried the mitigation through `ethtool -K eno1 tso off gso off`, but someone mentioned this was supposed to have been fixed on 5.4.18 and I'm running a newer kernel. I would like to not use the mitigation since I've read it severely affects network performance, which is key for my application. Does anyone have any suggestions? Thanks!

they are 2 differents bug.

the ethtool fix, is an old bug on some chipsets, and intel can't fix it (or don't want to fix it).

the kernel 5.5/5.4 fix is a bug introduced in kernel 5.0/5.1.

So please try ethtool too.

spirit · Mar 24, 2020

BTW, I think than disabling only "tso" should be enough, from the different bug reports on bugzilla.kernel.org.

tssge · Mar 25, 2020

To those who are unable to fix this with ethool: I relized that there's a VLAN offload feature in NICs and if you have VLANs on your host, this offlload can cause the issue to happen as well.

To disable all offloading on the NIC, the following command can be used:

Bash:

ethtool -K eno1 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off

This resolved the issue for me, even though the ethtool stuff given earlier in this thread didn't work. It's worth to try out at least.

n1nj4888 · Mar 26, 2020

Apollon77 said:
I have a one liner with eno1 in my file and added it to that ... no idea if correct :-(
Maybe someone can advice

Code:

iface eno1 inet manual post-up /sbin/ethtool -K eno1 tso off gso off

After installing ethtool on the node ...

Code:

apt install ethtool

... I tried this way first (putting the post-up under eno1) and rebooted. Checked whether the post-up had worked with "ethtool -k eno1 | grep offload" and I could see that it had not worked (tso was still enabled)... By placing the post-up line under both the "vmbr0" config (that eno1 is a bridge port of) and the "eno1" config, I could see that the config was set as expected after a reboot...

Code:

iface eno1 inet manual
        post-up ethtool -K eno1 tso off gso off

auto vmbr0
iface vmbr0 inet static
        address X.X.X.X/Y
        gateway X.X.X.X
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
        post-up ethtool -K eno1 tso off gso off

I'm not yet sure whether just tso, gso=off are required or whether the full line of post-up
"ethtool -K eno1 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off" are required but thought I'd start with the minimal disabled first and check whether it still causes the hardware hang on eno1...

Apollon77 · Mar 26, 2020

I also had the feeling that it dit not worked in a first place .. but wasnt shure how to verify correctly. But with this info I will also place in both

tssge · Mar 26, 2020

n1nj4888 said:
I'm not yet sure whether just tso, gso=off are required or whether the full line of post-up
"ethtool -K eno1 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off" are required but thought I'd start with the minimal disabled first and check whether it still causes the hardware hang on eno1

Yeah, most probably there are some certain offloads that need to be disabled and others can be left on. However, I don't want to debug it further on a production server myself and the CPU load increase from having no offloading at all even when utilizing 1Gbps 100% is neglible (at least on my servers)

spirit · Mar 26, 2020

n1nj4888 said:
I tried this way first (putting the post-up under eno1) and rebooted. Checked whether the post-up had worked with "ethtool -k eno1 | grep offload" and I could see that it had not worked (tso was still enabled)... By placing the post-up line under both the "vmbr0" config (that eno1 is a bridge port of) and the "eno1" config, I could see that the config was set as expected after a reboot...

Code:

iface eno1 inet manual post-up ethtool -K eno1 tso off gso off auto vmbr0 iface vmbr0 inet static address X.X.X.X/Y gateway X.X.X.X bridge-ports eno1 bridge-stp off bridge-fd 0 post-up ethtool -K eno1 tso off gso off

as you don't have "auto eno1", ifupdown1 don't execute the post-up in the eno1 section.

the eno1 is simply "ip link set eno1 up" by "bridge-ports eno1", then post-up in vmbr0 is executing after both vmbr0 && eno1 are up.

Kribbstar · Mar 29, 2020

I think I have the very same problem as everyone else here. Under heavy network load the NIC seems to go down and I can't reach proxmox or any VM's through ssh or the web gui. The network load come primary from one of the VM's running Sabnzbd.

This is my log output, which repeats over and over:

Code:

Mar 14 14:17:10 yggdrasil kernel: [6045453.108633] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Mar 14 14:17:10 yggdrasil kernel: [6045453.108633]   TDH                  <0>
Mar 14 14:17:10 yggdrasil kernel: [6045453.108633]   TDT                  <1>
Mar 14 14:17:10 yggdrasil kernel: [6045453.108633]   next_to_use          <1>
Mar 14 14:17:10 yggdrasil kernel: [6045453.108633]   next_to_clean        <0>
Mar 14 14:17:10 yggdrasil kernel: [6045453.108633] buffer_info[next_to_clean]:
Mar 14 14:17:10 yggdrasil kernel: [6045453.108633]   time_stamp           <15a14b4cb>
Mar 14 14:17:10 yggdrasil kernel: [6045453.108633]   next_to_watch        <0>
Mar 14 14:17:10 yggdrasil kernel: [6045453.108633]   jiffies              <15a14b8b8>
Mar 14 14:17:10 yggdrasil kernel: [6045453.108633]   next_to_watch.status <0>
Mar 14 14:17:10 yggdrasil kernel: [6045453.108633] MAC Status             <40080083>
Mar 14 14:17:10 yggdrasil kernel: [6045453.108633] PHY Status             <796d>
Mar 14 14:17:10 yggdrasil kernel: [6045453.108633] PHY 1000BASE-T Status  <3800>
Mar 14 14:17:10 yggdrasil kernel: [6045453.108633] PHY Extended Status    <3000>
Mar 14 14:17:10 yggdrasil kernel: [6045453.108633] PCI Status             <10>
Mar 14 14:17:11 yggdrasil kernel: [6045454.164399] e1000e 0000:00:1f.6 eno1: Reset adapter unexpectedly
Mar 14 14:17:11 yggdrasil kernel: [6045454.164439] vmbr0: port 1(eno1) entered disabled state
Mar 14 14:17:18 yggdrasil kernel: [6045461.092765] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Mar 14 14:17:18 yggdrasil kernel: [6045461.092810] vmbr0: port 1(eno1) entered blocking state
Mar 14 14:17:18 yggdrasil kernel: [6045461.092812] vmbr0: port 1(eno1) entered forwarding state

I'm running proxmox on a Intel NUC 7i5BNK.

I'm going to try the suggested solution:

Bash:

ethtool -K <device name> gso off tso off

And see how it works, I really hope we find a permanent fix for this soon.

trottelvottel · Mar 30, 2020

Morning!

I have the problem with flapping eno1 port on switch so i tried to exchange the cable etc.... it is a intel card in a lenovo m93 box

so i tried the kernel from a post here... no effect

Bash:

Linux 5.4.24-1-pve #1 SMP PVE 5.4.24-1 (Mon, 09 Mar 2020 12:59:46 +0100)

I am now testing with disable all offloads

Bash:

ethtool -K eno1 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off

Hope this will fixed soon.

Regards

n1nj4888 · Apr 3, 2020

Just to report back here... Since adding the "ethtool -K eno1 tso off gso off" to postup (about a week ago), I haven't had any further occurrences of the "Detected Hardware Unit Hang" issue... So it looks like only "tso off gso off" are required and not all the other parameters

dynek · Apr 3, 2020

PVE Kernel 5.4.24-1 did not fix this for me, I'm also still using ethtool.

trottelvottel · Apr 3, 2020

same here only ethtool -K eno1 tso off gso off work

e1000 driver hang

Well-Known Member

Proxmox Staff Member

Member

Well-Known Member

Proxmox Staff Member

Member

Well-Known Member

New Member

Distinguished Member

Distinguished Member

Member

Well-Known Member

Well-Known Member

Member

Distinguished Member

New Member

Member

Well-Known Member

Member

Member

We value your privacy