e1000 driver hang

tssge · Apr 3, 2020

n1nj4888 said:
Just to report back here... Since adding the "ethtool -K eno1 tso off gso off" to postup (about a week ago), I haven't had any further occurrences of the "Detected Hardware Unit Hang" issue... So it looks like only "tso off gso off" are required and not all the other parameters

Yes, for example disabling VLAN offload is required only if VLANs are used. It makes sense for the other features as well: if you're never getting UDP, a UDP offload won't trigger the bug in your card

But then again if you're not using VLANs, there's very little sense in keeping the offload on anyways.

Hyacin · May 16, 2020

Absolute weirdness for me. I have three identical NUC10i3FNs running identical versions of PVE. tso off gso off rxvlan off txvlan off fixed one of two that were acting up (I use vlan aware), on the other that is still acting up I've increased it now to rxvlan off txvlan off gso off gro off tso off tx off rx off, and on the third, I haven't had any issues at all with all of it on!!

So, FYI to everyone coming across this in the future - it's not a one-size fits all silver bullet. Either turn it all off at the start if you want to nip it in the bud as quickly as possible (and suffer the full weight of doing all that without hardware assist), or go bit by bit until it stops!

The most shocking thing to me though is that my 3rd box doesn't seem to need any of it to be turned off and it's plugging along just fine without a single hiccup.

tssge · May 16, 2020

Hyacin said:
The most shocking thing to me though is that my 3rd box doesn't seem to need any of it to be turned off and it's plugging along just fine without a single hiccup.

It seems that certain kind of traffic triggers this issue. Once it's triggered, it'll continue to bother you until you restart the box. Now I have no idea what causes it specifically, but I am pretty sure that it requires some certain kind of packets to be triggered.

Some of my machines seem to stay on with no issues for quite some time, but eventually all of them develop this issue at some point.

spirit · May 16, 2020

Hyacin said:
The most shocking thing to me though is that my 3rd box doesn't seem to need any of it to be turned off and it's plugging along just fine without a single hiccup.

same firmware version for the intel nic ?

Hyacin · May 16, 2020

spirit said:
same firmware version for the intel nic ?

Appears so, yes - 0.6-4

fireon · May 19, 2020

soooooo strange. same problem here, only on one interface of dual nic... after years with proxmox on there. Maybe the biosupdate from supermicro fix this.

Hyacin · May 22, 2020

I went out and bought a few $15 Realtek (chip) USB-C to GigE NICs, partially because I'd seen, I believe in this thread, that the performance of the onboard NIC takes a major hit with the offloading disabled, and also because I'd like to keep my iSCSI traffic on it's own link I think (very thankful I had an additional reason, lol) -

Code:

root@NUC10i3FNH-3:/# ip link set vmbr0.10 down
root@NUC10i3FNH-3:/# scp -oBindAddress=172.24.0.14 testfile rob@172.24.0.55: # Onboard Intel NIC (with stuff disabled for stability)
rob@172.24.0.55's password:
testfile                                                                                                                                  100% 5000MB 110.2MB/s   00:45
root@NUC10i3FNH-3:/# ip link set vmbr0.10 up
root@NUC10i3FNH-3:/# ip link set vmbr1.10 down
root@NUC10i3FNH-3:/# scp -oBindAddress=172.24.0.12 testfile rob@172.24.0.55: # USB-C NIC
rob@172.24.0.55's password:
testfile                                                                                                                                  100% 5000MB 110.8MB/s   00:45
root@NUC10i3FNH-3:/#

Apparently the 10th gen i3 doesn't break a sweat doing the functions that were formerly offloaded to the NIC hardware.

Oh, and in the reverse direction -

Code:

root@NUC10i3FNH-3:/# scp -oBindAddress=172.24.0.12 rob@172.24.0.55:testfile . # USB-C NIC
rob@172.24.0.55's password:
testfile                                                                                                                                  100% 5000MB 108.0MB/s   00:46
root@NUC10i3FNH-3:/# ip link set vmbr1.10 up
root@NUC10i3FNH-3:/# ip link set vmbr0.10 down
root@NUC10i3FNH-3:/# scp -oBindAddress=172.24.0.14 rob@172.24.0.55:testfile . # Onboard Intel NIC (with stuff disabled for stability)
rob@172.24.0.55's password:
testfile                                                                                                                                  100% 5000MB 108.6MB/s   00:46
root@NUC10i3FNH-3:/#

fireon · May 22, 2020

fireon said:
soooooo strange. same problem here, only on one interface of dual nic... after years with proxmox on there. Maybe the biosupdate from supermicro fix this.

Not really, motherboard damaged

mlrtime · May 25, 2020

What are the current recommended settings and how are you guys setting, I currently have this in crontab:

@reboot /usr/sbin/ethtool -K vmbr0 gso off gro off tso off >> /tmp/ethtool.fix 2>&1
@reboot /usr/sbin/ethtool -K eno1 gso off gro off tso off >> /tmp/ethtool.fix 2>&1

jdruwe · Jun 15, 2020

mlrtime said:
What are the current recommended settings and how are you guys setting, I currently have this in crontab:

@reboot /usr/sbin/ethtool -K vmbr0 gso off gro off tso off >> /tmp/ethtool.fix 2>&1
@reboot /usr/sbin/ethtool -K eno1 gso off gro off tso off >> /tmp/ethtool.fix 2>&1

I am also eager to know as I just experienced this same issue on my nuc.

encoder17 · Jun 22, 2020

Hello, I am experiencing same issue.
Tried almost every ethtool command that you typed here. Also, I have got changed whole server for another identical with same parameters and it did not help.
Can someone help me? I can even pay for fixing it.

Hyacin · Jul 8, 2020

jdruwe said:
I am also eager to know as I just experienced this same issue on my nuc.

I just picked up some USB-C to GigE adapters for my NUCs. Trying everything mentioned in this thread didn't seem to be resolving it for me. I then ended up leaving both links up in a LACP LAG (not a supported configuration, but I've had no issues) with my VLAN interfaces bound to the bond, so if that onboard NIC has a hiccup, everything just carries on over the other link in the LAG.

takerukoushirou · Jul 9, 2020

I ran into the same issue on my BXNUC10i7FNH2, the network interface would occasionally hang and get reset by the watchdog.

Disabling tso and gso helped in my case, no hang/reset since applying the settings whenever the network interface goes up:

Bash:

iface eno1 inet manual
    post-up /usr/bin/logger -p debug -t ifup "Disabling tso and gso for eno1" && /usr/sbin/ethtool -K eno1 tso off gso off && /usr/bin/logger -p debug -t ifup "Disabled tso and gso for eno1"

Adrianos712 · Jul 9, 2020

Hello, same issue here on NUC8I7BEH.
I tried the workaroud and it's not working even with the full offload disabled :

Code:

ethtool -K eno1 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off

I don't have intensive network use but when I spawn a win 10 VM (with GPU throughput) the problem occurs more often

terentev · Jul 31, 2020

NUC8i7BEH

i have this

Code:

 e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                   TDH                  <81>
                   TDT                  <90>
                   next_to_use          <90>
                   next_to_clean        <80>
                 buffer_info[next_to_clean]:
                   time_stamp           <1190bfc40>
                   next_to_watch        <81>
                   jiffies              <1190c0260>
                   next_to_watch.status <0>
                 MAC Status             <40080083>
                 PHY Status             <796d>
                 PHY 1000BASE-T Status  <3c00>
                 PHY Extended Status    <3000>
                 PCI Status             <10>

rullywow · Aug 15, 2020

I have the same issue. Intel NUC7i7BNH with 1TB nvme SSD and 32GB of RAM.

e1000 driver hang error.

Kernel:
5.4.44-1-pve

00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (4) I219-V [8086:15d8] (rev 21)

From reading this thread, there are two suggested fixes:
1) Disable tso and gso using ethtool (with a potential hit to throughput)
2) Use USB ethernet adapter

Any insight on best approach how to fix would be appreciated.

Thank you.

Hyacin · Aug 15, 2020

rullywow said:
2) Use USB ethernet adapter

I'm not having a good time with my USB-C GigE adapters. I'm not sure if it's the adapters (probably) or the NUCs, but the adapters keep disappearing entirely. I've got them paired with the onboard in a LACP LAG (not a supported configuration from what I read, but it works) as that's about all I can think of to do to get around the various issues I'm facing :-/ ... the whole situation is really disappointing for something as high-end and expensive as a NUC.

rullywow · Aug 15, 2020

Hyacin said:
I'm not having a good time with my USB-C GigE adapters. I'm not sure if it's the adapters (probably) or the NUCs, but the adapters keep disappearing entirely. I've got them paired with the onboard in a LACP LAG (not a supported configuration from what I read, but it works) as that's about all I can think of to do to get around the various issues I'm facing :-/ ... the whole situation is really disappointing for something as high-end and expensive as a NUC.

Agreed (and thanks for the insight). This shouldn't be an issue on a high-end NUC that Intel makes both the MB and the built-in NIC. It seems to only be a problem when pushing a lot of data through the NIC. In my case, Plex or SABNZBD or both.

I'm in the process of migrating this Proxmox install over to a USB-C to Gigabit Realtek adapter. Just got it to work by changing the /etc/network/intefaces.

I still would like to find a solution to use the built-in NIC but hoping this will at least stop the locking up. I think its most frustrating that when it locks up you can't even SSH into it or reboot etc. Needs a manual power off which of course isn't the best idea for the hosts and VMs.

Erk · Sep 8, 2020

Has anyone done a watchdog script to reboot the server when this NIC hang occurs, or does it require a power cycle?

chudak · Sep 10, 2020

I also see this problem

Code:

Sep 09 14:46:40 pve pvestatd[1206]: storage 'ISOs-SMB' is not online
Sep 09 14:46:42 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                              TDH                  <ab>
                              TDT                  <e4>
                              next_to_use          <e4>
                              next_to_clean        <aa>
                            buffer_info[next_to_clean]:
                              time_stamp           <10010824a>
                              next_to_watch        <ab>
                              jiffies              <100108a20>
                              next_to_watch.status <0>
                            MAC Status             <40080083>
                            PHY Status             <796d>
                            PHY 1000BASE-T Status  <3800>
                            PHY Extended Status    <3000>
                            PCI Status             <10>
Sep 09 14:46:42 pve kernel: e1000e 0000:00:1f.6 eno1: Reset adapter unexpectedly

cat /proc/version
Linux version 5.4.34-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.34-2 (Thu, 07 May 2020 10:02:02 +0200)

Code:

root@pve:~# ethtool eno1
Settings for eno1:
    Supported ports: [ TP ]
    Supported link modes:   10baseT/Half 10baseT/Full
                            100baseT/Half 100baseT/Full
                            1000baseT/Full
    Supported pause frame use: No
    Supports auto-negotiation: Yes
    Supported FEC modes: Not reported
    Advertised link modes:  10baseT/Half 10baseT/Full
                            100baseT/Half 100baseT/Full
                            1000baseT/Full
    Advertised pause frame use: No
    Advertised auto-negotiation: Yes
    Advertised FEC modes: Not reported
    Speed: 1000Mb/s
    Duplex: Full
    Port: Twisted Pair
    PHYAD: 1
    Transceiver: internal
    Auto-negotiation: on
    MDI-X: on (auto)
    Supports Wake-on: pumbg
    Wake-on: g
    Current message level: 0x00000007 (7)
                   drv probe link
    Link detected: yes

Can you please advise how to fix it ?

If I run

ethtool -K eno1 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off

Will it restore back as

ethtool -K eno1 gso on gro on tso on tx on rx on rxvlan on txvlan on sg on

??
Thx

e1000 driver hang

Member

Well-Known Member

Member

Distinguished Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member

Active Member

Active Member

New Member

Well-Known Member

Active Member

New Member

New Member

New Member

Well-Known Member

New Member

Renowned Member

Renowned Member

We value your privacy