[SOLVED] Intel NIC e1000e hardware unit hang

md127 · Mar 5, 2022

This has been discussed many times, is there a solution/workaround yet, as the specified steps are not working for me?
I've the following in /etc/network/interfaces

Code:

auto lo
iface lo inet loopback

iface eno1 inet manual
        offload-gso off
        offload-gro off
        offload-tso off
        offload-rx off
        offload-tx off
        offload-rxvlan off
        offload-txvlan off
        offload-sg off
        offload-ufo off
        offload-lro off
auto vmbr0
iface vmbr0 inet static
        address xxx
        gateway xxx
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0

Syslog:

Code:

Mar 05 09:38:22 snuc2 kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
  TDH                  <3b>
  TDT                  <59>
  next_to_use          <59>
  next_to_clean        <3a>
buffer_info[next_to_clean]:
  time_stamp           <1000bb386>
  next_to_watch        <3b>
  jiffies              <1000bb5e0>
  next_to_watch.status <0>
MAC Status             <40080083>
PHY Status             <796d>
PHY 1000BASE-T Status  <3800>
PHY Extended Status    <3000>
PCI Status             <10>
Mar 05 09:38:24 snuc2 kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
  TDH                  <3b>
  TDT                  <59>
  next_to_use          <59>
  next_to_clean        <3a>
buffer_info[next_to_clean]:
  time_stamp           <1000bb386>
  next_to_watch        <3b>
  jiffies              <1000bb7d8>
  next_to_watch.status <0>
MAC Status             <40080083>
PHY Status             <796d>
PHY 1000BASE-T Status  <3800>
PHY Extended Status    <3000>
PCI Status             <10>
Mar 05 09:38:26 snuc2 kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
  TDH                  <3b>
  TDT                  <59>
  next_to_use          <59>
  next_to_clean        <3a>
buffer_info[next_to_clean]:
  time_stamp           <1000bb386>
  next_to_watch        <3b>
  jiffies              <1000bb9c8>
  next_to_watch.status <0>
MAC Status             <40080083>
PHY Status             <796d>
PHY 1000BASE-T Status  <3800>
PHY Extended Status    <3000>
PCI Status             <10>
Mar 05 09:38:27 snuc2 kernel: ------------[ cut here ]------------
Mar 05 09:38:27 snuc2 kernel: NETDEV WATCHDOG: eno1 (e1000e): transmit queue 0 timed out
Mar 05 09:38:27 snuc2 kernel: WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:467 dev_watchdog+0x24c/0x250
Mar 05 09:38:27 snuc2 kernel: Modules linked in: tcp_diag inet_diag vfio_pci vfio_virqfd vfio_iommu_type1 vfio ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls softdog nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel mei_hdcp kvm i915 irqbypass crct10dif_pclmul ghash_clmulni_intel drm_kms_helper aesni_intel cec crypto_simd rc_core cryptd i2c_algo_bit fb_sys_fops mei_me syscopyarea rapl sysfillrect sysimgblt intel_cstate mei intel_wmi_thunderbolt intel_pch_thermal wmi_bmof ee1004 efi_pstore pcspkr mac_hid zfs(PO) acpi_pad zunicode(PO) acpi_tad zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp nfsd libiscsi_tcp libiscsi scsi_transport_iscsi auth_rpcgss nfs_acl overlay lockd grace drm sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq
Mar 05 09:38:27 snuc2 kernel:  hid_generic usbhid hid dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c uas usb_storage crc32_pclmul sdhci_pci i2c_i801 cqhci xhci_pci e1000e i2c_smbus xhci_pci_renesas sdhci thunderbolt intel_lpss_pci intel_lpss xhci_hcd idma64 wmi video pinctrl_cannonlake
Mar 05 09:38:27 snuc2 kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: P           O      5.13.19-4-pve #1
Mar 05 09:38:27 snuc2 kernel: Hardware name: Intel(R) Client Systems NUC10i7FNH/NUC10i7FNB, BIOS FNCML357.0055.2021.1202.1748 12/02/2021
Mar 05 09:38:27 snuc2 kernel: RIP: 0010:dev_watchdog+0x24c/0x250
Mar 05 09:38:27 snuc2 kernel: Code: ba 26 fd ff eb ab 4c 89 ff c6 05 65 fd 4f 01 01 e8 a9 ef f9 ff 44 89 e9 4c 89 fe 48 c7 c7 f8 d2 c8 8f 48 89 c2 e8 2c f9 19 00 <0f> 0b eb 8c 0f 1f 44 00 00 55 48 89 e5 41 57 49 89 d7 41 56 4d 89
Mar 05 09:38:27 snuc2 kernel: RSP: 0018:ffff9de5c0003e80 EFLAGS: 00010282
Mar 05 09:38:27 snuc2 kernel: RAX: 0000000000000000 RBX: ffff92b301fb4200 RCX: ffff92baa1a209c8
Mar 05 09:38:27 snuc2 kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff92baa1a209c0
Mar 05 09:38:27 snuc2 kernel: RBP: ffff9de5c0003eb0 R08: 0000000000000000 R09: ffff9de5c0003c60
Mar 05 09:38:27 snuc2 kernel: R10: ffff9de5c0003c58 R11: ffffffff90355428 R12: ffff92b301fb4280
Mar 05 09:38:27 snuc2 kernel: R13: 0000000000000000 R14: ffff92b310b30480 R15: ffff92b310b30000
Mar 05 09:38:27 snuc2 kernel: FS:  0000000000000000(0000) GS:ffff92baa1a00000(0000) knlGS:0000000000000000
Mar 05 09:38:27 snuc2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 05 09:38:27 snuc2 kernel: CR2: 00007fab75816670 CR3: 000000011ac98003 CR4: 00000000003726f0
Mar 05 09:38:27 snuc2 kernel: Call Trace:
Mar 05 09:38:27 snuc2 kernel:  <IRQ>
Mar 05 09:38:27 snuc2 kernel:  ? pfifo_fast_enqueue+0x150/0x150
Mar 05 09:38:27 snuc2 kernel:  call_timer_fn+0x2c/0x100
Mar 05 09:38:27 snuc2 kernel:  __run_timers.part.0+0x1d8/0x250
Mar 05 09:38:27 snuc2 kernel:  ? ktime_get+0x3b/0xa0
Mar 05 09:38:27 snuc2 kernel:  ? lapic_next_deadline+0x2c/0x40
Mar 05 09:38:27 snuc2 kernel:  ? clockevents_program_event+0x8f/0xe0
Mar 05 09:38:27 snuc2 kernel:  run_timer_softirq+0x2a/0x50
Mar 05 09:38:27 snuc2 kernel:  __do_softirq+0xcb/0x281
Mar 05 09:38:27 snuc2 kernel:  irq_exit_rcu+0xa2/0xd0
Mar 05 09:38:27 snuc2 kernel:  sysvec_apic_timer_interrupt+0x7c/0x90
Mar 05 09:38:27 snuc2 kernel:  </IRQ>
Mar 05 09:38:27 snuc2 kernel:  <TASK>
Mar 05 09:38:27 snuc2 kernel:  asm_sysvec_apic_timer_interrupt+0x12/0x20
Mar 05 09:38:27 snuc2 kernel: RIP: 0010:cpuidle_enter_state+0xcc/0x360
Mar 05 09:38:27 snuc2 kernel: Code: 3d 31 b2 0d 71 e8 44 74 7a ff 49 89 c6 0f 1f 44 00 00 31 ff e8 e5 7f 7a ff 80 7d d7 00 0f 85 01 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 ff 0f 88 0d 01 00 00 49 63 cf 4c 2b 75 c8 48 8d 04 49 48 89
Mar 05 09:38:27 snuc2 kernel: RSP: 0018:ffffffff90203de8 EFLAGS: 00000246
Mar 05 09:38:27 snuc2 kernel: RAX: ffff92baa1a34ec0 RBX: 0000000000000001 RCX: 000000000000001f
Mar 05 09:38:27 snuc2 kernel: RDX: 0000000000000000 RSI: 000000004f9a1f43 RDI: 0000000000000000
Mar 05 09:38:27 snuc2 kernel: RBP: ffffffff90203e20 R08: 00000311d4ab5902 R09: 0000000000000008
Mar 05 09:38:27 snuc2 kernel: R10: 0000000000000213 R11: ffffffff90450e00 R12: ffffbde5bf800500
Mar 05 09:38:27 snuc2 kernel: R13: ffffffff90450e00 R14: 00000311d4ab5902 R15: 0000000000000001
Mar 05 09:38:27 snuc2 kernel:  ? cpuidle_enter_state+0xbb/0x360
Mar 05 09:38:27 snuc2 kernel:  cpuidle_enter+0x2e/0x40
Mar 05 09:38:27 snuc2 kernel:  do_idle+0x1ff/0x2a0
Mar 05 09:38:27 snuc2 kernel:  cpu_startup_entry+0x20/0x30
Mar 05 09:38:27 snuc2 kernel:  rest_init+0xb8/0xba
Mar 05 09:38:27 snuc2 kernel:  arch_call_rest_init+0xe/0x1b
Mar 05 09:38:27 snuc2 kernel:  start_kernel+0x836/0x85c
Mar 05 09:38:27 snuc2 kernel:  x86_64_start_reservations+0x24/0x26
Mar 05 09:38:27 snuc2 kernel:  x86_64_start_kernel+0x8b/0x8f
Mar 05 09:38:27 snuc2 kernel:  secondary_startup_64_no_verify+0xc2/0xcb
Mar 05 09:38:27 snuc2 kernel:  </TASK>

oguz · Mar 14, 2022

hi,

md127 said:
This has been discussed many times, is there a solution/workaround yet, as the specified steps are not working for me?

where did you find those steps?

could you install ethtool and check your interface like: ethtool -k eno1 | grep offload to see if they're really disabled?

if not then please check the workaround in the post here [0] (the original linked thread is [1] but quite a long one to go through), using ethtool to set those options off.

if none of that works, post the requested information from my first post on thread [0]

[0]: https://forum.proxmox.com/threads/trap-error-on-e1000-network-adapter.105758
[1]: https://forum.proxmox.com/threads/e1000-driver-hang.58284/page-4#post-302307

md127 · Mar 14, 2022

oguz said:
could you install ethtool and check your interface like: ethtool -k eno1 | grep offload to see if they're really disabled?

@oguz That's a great tip as although I had them disabled in the interface file (see below), the offloading got enabled automatically! Do we need to setup a cron job to periodically disable this?

Code:

auto lo
iface lo inet loopback

iface eno1 inet manual
        offload-gso off
        offload-gro off
        offload-tso off
        offload-rx off
        offload-tx off
        offload-rxvlan off
        offload-txvlan off
        offload-sg off
        offload-ufo off
        offload-lro off
auto vmbr0
iface vmbr0 inet static
        address 192.168.1.158/24
        gateway 192.168.1.1
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0

oguz · Mar 14, 2022

md127 said:
! Do we need to setup a cron job to periodically disable this?

check the linked post [0], you can add a post-up directive in your network configuration file as described there (that will basically rerun the workaround when the interface is up)

md127 · Mar 14, 2022

oguz said:
check the linked post [0], you can add a post-up directive in your network configuration file as described there (that will basically rerun the workaround when the interface is up)

Thanks! Apply them to the vmbr0 interface as well?
And reboot PVE?

oguz · Mar 14, 2022

md127 said:
Apply them to the vmbr0 interface as well?
And reboot PVE?

yes and yes. you can also install ifupdown2 and do ifreload -a instead of rebooting

md127 · Mar 14, 2022

Awesome! thanks again. Will try and mark this as a "solution" if it works

Last question (for the time being): only gso/tso off? What about the others? Right now I've switched off a lot more options in the interfaces file shared above, which gives me this:

Code:

root@snuc2:~# ethtool -k eno1|grep offload
tcp-segmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off [fixed]
rx-vlan-offload: off
tx-vlan-offload: off
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
macsec-hw-offload: off [fixed]
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]

oguz · Mar 15, 2022

md127 said:
Last question (for the time being): only gso/tso off? What about the others?

try the options as posted before: ethtool -K eno1 gso off gro off tso off tx off rx off

if it doesn't solve your issue please report

md127 · Mar 19, 2022

@oguz guess that's it! I don't see any more hardware hangs. Thanks much for helping debug the issue and provide the workaround.
Is it possible to disable rx-vlan/tx-vlan offload in that script as well?

Code:

root@snuc2:~# ethtool -k eno1|grep offload
tcp-segmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off [fixed]
rx-vlan-offload: off
tx-vlan-offload: off
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
macsec-hw-offload: off [fixed]
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]

roycordero · Nov 11, 2023

thanks to this thread I solved my bug that had me sleepless for days.

RobGranger · Nov 18, 2023

Question, I know this is an old thread: Has anyone been able to get a PCIe X1 card working in the Z4. I have tried three that I have and none show up in the dmesg log as a discovered device. I would like to get a working NIC instead of the buggy integrated ones.
Thanks!
Rob

jsalas424 · Sunday at 05:08

I'm on PVE 8.2.4 w/Kernel Linux 6.8.8-2-pve and am still seeing this issue on my Intel NICs. Is ethtool -K eno1 gso off gro off tso off tx off rx off with a postup in the /etc/network/interfaces file still the prescribed fix? Is there an effort/intent to fix this permanently? I'm seeing this reliably when my NICs are under load.

Code:

root@Server:~# lspci | grep Ethernet
00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I217-LM (rev 04)
02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection

Code:

Jun 29 23:01:43 Server kernel: e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
TDH                  <b4>
TDT                  <e1>
next_to_use          <e1>
next_to_clean        <b3>
buffer_info[next_to_clean]:
time_stamp           <10fe37002>
next_to_watch        <b4>
jiffies              <10fe38fc0>
next_to_watch.status <0>
MAC Status             <80083>
PHY Status             <796d>
PHY 1000BASE-T Status  <3800>
PHY Extended Status    <3000>
PCI Status             <10>
Jun 29 23:01:43 Server kernel: e1000e 0000:00:19.0 eno1: NETDEV WATCHDOG: CPU: 3: transmit queue 0 timed out 8189 ms
Jun 29 23:01:43 Server kernel: e1000e 0000:00:19.0 eno1: Reset adapter unexpectedly
Jun 29 23:01:44 Server kernel: vmbr0: port 1(eno1) entered disabled state
Jun 29 23:01:47 Server kernel: e1000e 0000:00:19.0 eno1: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

Code:

/etc/network/interfaces file:

iface eno1 inet manual
    post-up ethtool -K eno1 tso off gso off

Search

Search

[SOLVED] Intel NIC e1000e hardware unit hang

md127

New Member

oguz

Proxmox Retired Staff

md127

New Member

oguz

Proxmox Retired Staff

md127

New Member

oguz

Proxmox Retired Staff

md127

New Member

oguz

Proxmox Retired Staff

md127

New Member

roycordero

New Member

RobGranger

Member

jsalas424

Member