e1000 driver hang

lmm5247 · Mar 9, 2021

Commenting for visibility and to track solutions. I have a NUC7i3BNH running Proxmox and receive the "Detected Hardware Unit Hang" error when uploading large files inside a KVM to the cloud. It completely hangs the entire Proxmox machine and needs to be hard-rebooted.

Code:

Mar  8 23:01:23 proxmox01 kernel: [2280865.612614] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Mar  8 23:01:23 proxmox01 kernel: [2280865.612614]   TDH                  <cf>
Mar  8 23:01:23 proxmox01 kernel: [2280865.612614]   TDT                  <17>
Mar  8 23:01:23 proxmox01 kernel: [2280865.612614]   next_to_use          <17>
Mar  8 23:01:23 proxmox01 kernel: [2280865.612614]   next_to_clean        <cf>
Mar  8 23:01:23 proxmox01 kernel: [2280865.612614] buffer_info[next_to_clean]:
Mar  8 23:01:23 proxmox01 kernel: [2280865.612614]   time_stamp           <121fbbae8>
Mar  8 23:01:23 proxmox01 kernel: [2280865.612614]   next_to_watch        <d0>
Mar  8 23:01:23 proxmox01 kernel: [2280865.612614]   jiffies              <121fbbd98>
Mar  8 23:01:23 proxmox01 kernel: [2280865.612614]   next_to_watch.status <0>
Mar  8 23:01:23 proxmox01 kernel: [2280865.612614] MAC Status             <40080083>
Mar  8 23:01:23 proxmox01 kernel: [2280865.612614] PHY Status             <796d>
Mar  8 23:01:23 proxmox01 kernel: [2280865.612614] PHY 1000BASE-T Status  <3c00>
Mar  8 23:01:23 proxmox01 kernel: [2280865.612614] PHY Extended Status    <3000>
Mar  8 23:01:23 proxmox01 kernel: [2280865.612614] PCI Status             <10>

Here is my NIC:

Code:

root@proxmox01:~# lspci -nnk | grep -A2 Ethernet
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (4) I219-V [8086:15d8] (rev 21)
        Subsystem: Intel Corporation Ethernet Connection (4) I219-V [8086:2068]
        Kernel driver in use: e1000e
        Kernel modules: e1000e

I set the following in "/etc/network/interfaces".

Code:

iface eno1 inet manual
    post-up /usr/bin/logger -p debug -t ifup "Disabling offload for eno1" && /sbin/ethtool -K $IFACE tso off gso off gro off && /usr/bin/logger -p debug -t ifup "Disabled offload for eno1"

Here is the syslog snippet of that post-up config running.

Code:

Mar  9 10:55:42 proxmox01 ifup: Disabling offload for eno1
Mar  9 10:55:42 proxmox01 ifup: Disabled offload for eno1

Hopefully this will fix my issue. Will keep an eye out for a more permanent solution.

prx · May 15, 2021

I was having this issue with the interface being reset all the time under heavy load.

Here is the error:

Code:

[Fri May 14 23:55:54 2021] ------------[ cut here ]------------
[Fri May 14 23:55:54 2021] NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
[Fri May 14 23:55:54 2021] WARNING: CPU: 12 PID: 0 at net/sched/sch_generic.c:448 dev_watchdog+0x264/0x270
[Fri May 14 23:55:54 2021] Modules linked in: veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw softdog ip6table_mangle ip6table_filter ip6_tables xt_conntrack xt_tcpudp xt_nat xt_MASQUERADE iptable_nat nf_nat nfnetlink_log bpfilter nfnetlink intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass rapl intel_cstate input_leds serio_raw wmi_bmof intel_wmi_thunderbolt intel_pch_thermal acpi_pad mac_hid vhost_net vhost tap coretemp sunrpc autofs4 btrfs zstd_compress dm_crypt raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid0 multipath linear xt_comment xt_recent xt_connlimit nf_conncount xt_state nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_length xt_hl xt_tcpmss xt_TCPMSS ipt_REJECT nf_reject_ipv4 xt_dscp xt_multiport xt_limit iptable_mangle iptable_filter ip_tables x_tables bfq raid1 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper ahci xhci_pci e1000e i2c_i801
[Fri May 14 23:55:54 2021]  libahci xhci_hcd wmi video pinctrl_cannonlake pinctrl_intel
[Fri May 14 23:55:54 2021] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 5.4.114-1-pve #1
[Fri May 14 23:55:54 2021] Hardware name: Gigabyte Technology Co., Ltd. B360 HD3P-LM/B360HD3PLM-CF, BIOS F4 HZ 04/30/2019
[Fri May 14 23:55:54 2021] RIP: 0010:dev_watchdog+0x264/0x270
[Fri May 14 23:55:54 2021] Code: 48 85 c0 75 e6 eb a0 4c 89 ef c6 05 80 c8 ef 00 01 e8 20 b8 fa ff 89 d9 4c 89 ee 48 c7 c7 98 5c c3 92 48 89 c2 e8 c5 56 15 00 <0f> 0b eb 82 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
[Fri May 14 23:55:54 2021] RSP: 0018:ffff9decc03d8e58 EFLAGS: 00010282
[Fri May 14 23:55:54 2021] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000083f
[Fri May 14 23:55:54 2021] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f
[Fri May 14 23:55:54 2021] RBP: ffff9decc03d8e88 R08: 00000000000003a4 R09: ffffffff9339e768
[Fri May 14 23:55:54 2021] R10: 0000000000000774 R11: ffff9decc03d8cb0 R12: 0000000000000001
[Fri May 14 23:55:54 2021] R13: ffff925deb2a8000 R14: ffff925deb2a8480 R15: ffff925deb1ee880
[Fri May 14 23:55:54 2021] FS:  0000000000000000(0000) GS:ffff925dff300000(0000) knlGS:0000000000000000
[Fri May 14 23:55:54 2021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri May 14 23:55:54 2021] CR2: 00007f38443ebbc8 CR3: 0000000e649e6003 CR4: 00000000003606e0
[Fri May 14 23:55:54 2021] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Fri May 14 23:55:54 2021] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[Fri May 14 23:55:54 2021] Call Trace:
[Fri May 14 23:55:54 2021]  <IRQ>
[Fri May 14 23:55:54 2021]  ? pfifo_fast_enqueue+0x160/0x160
[Fri May 14 23:55:54 2021]  call_timer_fn+0x32/0x130
[Fri May 14 23:55:54 2021]  run_timer_softirq+0x1a5/0x430
[Fri May 14 23:55:54 2021]  ? ktime_get+0x3c/0xa0
[Fri May 14 23:55:54 2021]  ? lapic_next_deadline+0x2c/0x40
[Fri May 14 23:55:54 2021]  ? clockevents_program_event+0x93/0xf0
[Fri May 14 23:55:54 2021]  __do_softirq+0xdc/0x2d4
[Fri May 14 23:55:54 2021]  irq_exit+0xa9/0xb0
[Fri May 14 23:55:54 2021]  smp_apic_timer_interrupt+0x79/0x130
[Fri May 14 23:55:54 2021]  apic_timer_interrupt+0xf/0x20
[Fri May 14 23:55:54 2021]  </IRQ>
[Fri May 14 23:55:54 2021] RIP: 0010:cpuidle_enter_state+0xbd/0x450
[Fri May 14 23:55:54 2021] Code: ff e8 b7 79 88 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 63 03 00 00 31 ff e8 ba 81 8e ff fb 66 0f 1f 44 00 00 <45> 85 ed 0f 88 8d 02 00 00 49 63 cd 48 8b 75 d0 48 2b 75 c8 48 8d
[Fri May 14 23:55:54 2021] RSP: 0018:ffff9decc0147e48 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[Fri May 14 23:55:54 2021] RAX: ffff925dff32ae00 RBX: ffffffff92f57c40 RCX: 000000000000001f
[Fri May 14 23:55:54 2021] RDX: 000002c9a813f813 RSI: 00000000238e3d6b RDI: 0000000000000000
[Fri May 14 23:55:54 2021] RBP: ffff9decc0147e88 R08: 0000000000000002 R09: 000000000002a680
[Fri May 14 23:55:54 2021] R10: 00000a21d04c5df8 R11: ffff925dff329aa0 R12: ffffbdecbfd16f08
[Fri May 14 23:55:54 2021] R13: 0000000000000001 R14: ffffffff92f57cb8 R15: ffffffff92f57ca0
[Fri May 14 23:55:54 2021]  ? cpuidle_enter_state+0x99/0x450
[Fri May 14 23:55:54 2021]  cpuidle_enter+0x2e/0x40
[Fri May 14 23:55:54 2021]  call_cpuidle+0x23/0x40
[Fri May 14 23:55:54 2021]  do_idle+0x22c/0x270
[Fri May 14 23:55:54 2021]  cpu_startup_entry+0x1d/0x20
[Fri May 14 23:55:54 2021]  start_secondary+0x166/0x1c0
[Fri May 14 23:55:54 2021]  secondary_startup_64+0xa4/0xb0
[Fri May 14 23:55:54 2021] ---[ end trace ab9792688d4e93f4 ]---
[Fri May 14 23:55:54 2021] e1000e 0000:00:1f.6 eth0: Reset adapter unexpectedly
[Fri May 14 23:56:00 2021] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[Fri May 14 23:58:08 2021] e1000e 0000:00:1f.6 eth0: Reset adapter unexpectedly
[Fri May 14 23:58:13 2021] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[Sat May 15 00:08:17 2021] e1000e 0000:00:1f.6 eth0: Reset adapter unexpectedly
[Sat May 15 00:08:22 2021] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[Sat May 15 00:08:33 2021] e1000e 0000:00:1f.6 eth0: Reset adapter unexpectedly

It happens on kernels:

* Linux version 5.4.114-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.114-1 (Sun, 09 May 2021 17:13:05 +0200) ()
* Linux version 5.11.7-1-pve (build@pve) (gcc (Debian 8.3.0-6) 8.3.0, GNU ld (GNU Binutils for Debian) 2.31.1) #1 SMP PVE 5.11.7-1~bpo10 (Thu, 18 Mar 2021 16:17:24 +0100) ()

I have this NIC:

Code:

00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (7) I219-LM (rev 10)

but it might happen as well on any other one related.

I've tried settings various kernel options in /etc/default/grub, e.g.:

Code:

pcie_aspm=off

but it didn't help.

The only workaround here is (replace eth0 with your interface name):

Code:

apt install -y ethtool
ethtool -K eth0 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off

to make this permanent just add this into your /etc/network/interfaces:

Code:

auto eth0
iface eth0 inet static
  offload-gso off
  offload-gro off
  offload-tso off
  offload-rx off
  offload-tx off
  offload-rxvlan off
  offload-txvlan off
  offload-sg off
  offload-ufo off
  offload-lro off
  address x.x.x.x
  netmask a.a.a.a
  gateway z.z.z.z

NOTE: only disabling tso or gso doesn't help in my case I had to disable all offloading!

chudak · May 21, 2021

I think I see the problem again and switched back to E1000 driver

cagnulein · Jun 16, 2021

thanks @prx it works for me too!

masgo · Oct 4, 2021

I believe I had the same or a similar error. Server with two NICs, the "I219-V" (eno1) is used for VMs with VLANs. When the VMs started the connection to proxmox was lost. Accessing via the other nic worked fine. The network of the VMs worked fine. Just the proxmox host could no longer use this connection and only a reboot would restore it. Disabling the offloading as mentioned above did not change anything.

I am using two VLANs (19 and 70). What I found out, is that it was a specific VM which "crashed" it. Proxmox is using VLAN 19. The two VMs where using VLAN 70. The problems started the moment I migrated a third VM to the host. This VM had two NICs. One using VLAN 70 and the other using VLAN 19 but set to "disconnected". The moment this VM started, the proxmox host would loose the connection. The VM would boot and have normal network connection! After removing the second NIC from the VM (which was a remaining thing from some past experiments) and rebooting the host, everything worked fine.

CochraneServer · Dec 21, 2021

chudak said:
I think I see the problem again and switched back to E1000 driver

I'm having this issue with an Intel I217-LM (rev 04) on-board NIC - how did you change the driver?

md127 · Mar 11, 2022

I can confirm that the workaround mentioned on this thread doesn't work.

n1nj4888 · Mar 11, 2022

md127 said:
I can confirm that the workaround mentioned on this thread doesn't work.

I’m still using the disable tso gso workaround I posted previously in the following post - since I did this (and restarted the node afterwards), I’ve never had any of the “Detected unit hardware hang” errors that I used to get, so suggest you try that?

Post in thread 'e1000 driver hang'
https://forum.proxmox.com/threads/e1000-driver-hang.58284/post-303366

masgo · Mar 11, 2022

n1nj4888 said:
I’m still using the disable tso gso workaround I posted previously in the following post - since I did this (and restarted the node afterwards), I’ve never had any of the “Detected unit hardware hang” errors that I used to get, so suggest you try that?

Post in thread 'e1000 driver hang'
https://forum.proxmox.com/threads/e1000-driver-hang.58284/post-303366

I tried this, and any other suggestion I could find, but the problems remained. They were less frequent with the tso gso workaround, but they still happened.
Last week I gave up and switched it for a different NIC (Intel E43709-004 - Based on the Intel 82576 Chipset) the problems are gone.

prx · Mar 11, 2022

masgo said:
I tried this, and any other suggestion I could find, but the problems remained. They were less frequent with the tso gso workaround, but they still happened.
Last week I gave up and switched it for a different NIC (Intel E43709-004 - Based on the Intel 82576 Chipset) the problems are gone.

Did you try my workaround - https://forum.proxmox.com/threads/e1000-driver-hang.58284/post-390709 ?
On some intel cards you need to disable all offload options.

md127 · Mar 11, 2022

prx said:
Did you try my workaround - https://forum.proxmox.com/threads/e1000-driver-hang.58284/post-390709 ?
On some intel cards you need to disable all offload options.

Thanks for responding. Yes I did, still getting the hang messages randomly: https://forum.proxmox.com/threads/intel-nuc-10-i219-v-e1000e-hardware-hang.106294/

Which kernel are you on?

prx · Mar 11, 2022

@md127 I have 5.4.140-1.pve on some boxes with intel NICs

md127 · Mar 11, 2022

prx said:
@md127 I have 5.4.140-1.pve on some boxes with intel NICs

@prx that could make a difference as mine are on 5.13.19-4-pve

masgo · Mar 11, 2022

It seems to me that the problem is VLAN related. I used the same NIC for quite some time without problems. The problems started then I added VLANs to my setup. But this might be a coincidence, since I also did a PVE update at the same time.

Anyways, thanks for your help. Since a used server NIC is quite cheap on ebay, I just went with that route.

md127 · Mar 12, 2022

I don’t have any VLAN setup. If any Proxmox staff is reading this perhaps they could acknowledge and document that Intel NICs are not supported.

spirit · Mar 12, 2022

md127 said:
I don’t have any VLAN setup. If any Proxmox staff is reading this perhaps they could acknowledge and document that Intel NICs are not supported.

The problem is only mostly with intel chipset in nic used by nuc. (e1000 drivers is working fine for a lot of other intel chipsets)

Hyacin · Mar 22, 2022

spirit said:
The problem is only mostly with intel chipset in nic used by nuc. (e1000 drivers is working fine for a lot of other intel chipsets)

FWIW I have the same LM218 or whatever it is chipset in my Dell Precision T5810 Xeon tower. I was shocked and amazed to discover that. I'll hopefully be replacing it with a PCI-E NIC with SR-IOV support soon though and then just turning it off entirely. Unreal that Intel would let an issue like this linger in Linux for so long with such a common and widely deployed network controller (I had the exact same one in my last high-end Asus consumer motherboard too!)

Nuke · Mar 25, 2022

Joined to this topic.

Linux 5.15.19-2-pve #1 SMP PVE 5.15.19-3
Asus PRIME-H610M-A-D4
00:16.0 Communication controller [0780]: Intel Corporation Device [8086:7ae8] (rev 11))
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (17) I219-V [8086:1a1d] (rev 11)

prx said:
ethtool -K eth0 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off

this helped me
but it is not helped when I set it permanently via /etc/network/interfaces
After reboot errors does not stopped to be until i make ethtool command above manually

Update:/
Ok i made /usr/local/etc/ethtool.sh
with

Code:

#!/bin/bash
ethtool -K eno1 gso off gro on tso off tx on rx on rxvlan on txvlan on sg on && ethtool -K vmbr0 gso off gro on tso off tx on rx on rxvlan on txvlan on sg on

then create /etc/systemd/system/ethtool.service
with

Code:

[Unit]
Description=ethtool script

[Service]
WorkingDirectory=/usr/local/etc/
ExecStart=/usr/local/etc/ethtool.sh

[Install]
WantedBy=multi-user.target

then systemctl enable ethtool.service && systemctl start ethtool.service

pringlestuffs · Apr 7, 2022

Just wanted to confirm that this is a problem with the following card (in a common-as-dirt old higher end Dell desktop), on pve-kernel-5.13:

Code:

# lspci -nnk | grep -A2 Ethernet
00:19.0 Ethernet controller [0200]: Intel Corporation Ethernet Connection I217-LM [8086:153a] (rev 04)
        DeviceName:  Onboard LAN
        Subsystem: Dell Ethernet Connection I217-LM [1028:05a4]
        Kernel driver in use: e1000e
        Kernel modules: e1000e

Also, I had to use the 'ethtool' command in /etc/network/interfaces, as the following DID NOT have any effect on the setting as reported by 'ethtool -k'

Code:

iface eno1 inet manual
        offload-tso off  
        offload-gso off

Edit: What did work to disable the segmentation offloading for me was to putting this in /etc/network/interfaces as others have done. Note I didn't turn off GRO, only TSO and GSO as my read of the situation tells me turning off only TSO and GSO is the most conservative approach to try first. I haven't tested extensively to see if the problem recurs or not.

Code:

iface eno1 inet manual
        post-up /usr/bin/logger -p debug -t ifup "Disabling segmentation offload for eno1" && /sbin/ethtool -K $IFACE tso off gso off && /usr/bin/logger -p debug -t ifup "Disabled offload for eno1"

markhaines · May 5, 2022

pringlestuffs said:
Edit: What did work to disable the segmentation offloading for me was to putting this in /etc/network/interfaces as others have done. Note I didn't turn off GRO, only TSO and GSO as my read of the situation tells me turning off only TSO and GSO is the most conservative approach to try first. I haven't tested extensively to see if the problem recurs or not.

Code:

iface eno1 inet manual post-up /usr/bin/logger -p debug -t ifup "Disabling segmentation offload for eno1" && /sbin/ethtool -K $IFACE tso off gso off && /usr/bin/logger -p debug -t ifup "Disabled offload for eno1"

@pringlestuffs I'm seeing same errors, now you've had line above running for a month are you happy it's fixed the issue?

e1000 driver hang

Member

New Member

Well-Known Member

Member

Well-Known Member

Member

New Member

Well-Known Member

Well-Known Member

New Member

New Member

New Member

New Member

Well-Known Member

New Member

Distinguished Member

Well-Known Member

New Member

Member

Member

We value your privacy