e1000 driver hang

alatteri · Sep 23, 2019

In the past week we are seeing random e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang failuresacross all our nodes, even different hardware hosts. Must do a reset of the host.

There are lots of references to this issue going back 5+ years. Was there a driver change with the latest updates? We've run years with this hardware without issue. Now just this week its popping up all over.

Kernel Version Linux 5.0.21-2-pve #1 SMP PVE 5.0.21-3 (Thu, 05 Sep 2019 13:56:01 +0200)
PVE Manager Version pve-manager/6.0-7/2898402

Sep 22 20:03:08 vmhost03 kernel: [154458.471981] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] TDH <39>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] TDT <8f>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] next_to_use <8f>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] next_to_clean <39>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] buffer_info[next_to_clean]:
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] time_stamp <1024c0b53>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] next_to_watch <3a>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] jiffies <1024c11f0>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] next_to_watch.status <0>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] MAC Status <40080083>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] PHY Status <796d>

Sep 16 14:36:41 vmhost03 kernel: [67010.834277] NETDEV WATCHDOG: eno1 (e1000e): transmit queue 0 timed out
Sep 16 14:36:41 vmhost03 kernel: [67010.834295] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x221/0x230
Sep 16 14:36:41 vmhost03 kernel: [67010.834295] Modules linked in: veth arc4 md4 cmac nls_utf8 cifs ccm fscache ebtable_filter ebtables ip_set ip6table_filter ip6_tables sctp iptabl
e_filter bpfilter softdog nfnetlink_log nfnetlink intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel crct10dif_pclmul crc32_pclmul ghash_clmulni_intel nls_iso8859_1
aesni_intel zfs(PO) aes_x86_64 crypto_simd cryptd glue_helper zunicode(PO) zlua(PO) intel_cstate i915 kvmgt intel_rapl_perf snd_pcm vfio_mdev mdev vfio_iommu_type1 snd_timer vfio s
nd soundcore pcspkr kvm wmi_bmof irqbypass intel_wmi_thunderbolt drm_kms_helper drm intel_xhci_usb_role_switch i2c_algo_bit mei_me roles fb_sys_fops syscopyarea sysfillrect mei sysi
mgblt intel_pch_thermal acpi_pad mac_hid zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi sunr
pc scsi_transport_iscsi ip_tables x_tables autofs4 xfs btrfs xor zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison
Sep 16 14:36:41 vmhost03 kernel: [67010.834327] dm_bufio libcrc32c i2c_i801 ahci e1000e libahci wmi video
Sep 16 14:36:41 vmhost03 kernel: [67010.834331] CPU: 2 PID: 0 Comm: swapper/2 Tainted: P O 5.0.21-1-pve #1
Sep 16 14:36:41 vmhost03 kernel: [67010.834332] Hardware name: Intel Corporation NUC7i3BNK/NUC7i3BNB, BIOS BNKBL357.86A.0080.2019.0725.1139 07/25/2019
Sep 16 14:36:41 vmhost03 kernel: [67010.834334] RIP: 0010:dev_watchdog+0x221/0x230
Sep 16 14:36:41 vmhost03 kernel: [67010.834335] Code: 00 49 63 4e e0 eb 92 4c 89 ef c6 05 0b a2 ef 00 01 e8 f3 2a fc ff 89 d9 4c 89 ee 48 c7 c7 30 0a 1b ab 48 89 c2 e8 b1 d5 78 ff <
0f> 0b eb c0 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
Sep 16 14:36:41 vmhost03 kernel: [67010.834336] RSP: 0018:ffff97b2deb03e68 EFLAGS: 00010286
Sep 16 14:36:41 vmhost03 kernel: [67010.834337] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
Sep 16 14:36:41 vmhost03 kernel: [67010.834338] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff97b2deb16440
Sep 16 14:36:41 vmhost03 kernel: [67010.834339] RBP: ffff97b2deb03e98 R08: 0000000000000001 R09: 00000000000003ca
Sep 16 14:36:41 vmhost03 kernel: [67010.834339] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000001
Sep 16 14:36:41 vmhost03 kernel: [67010.834340] R13: ffff97b2cf570000 R14: ffff97b2cf5704c0 R15: ffff97b2d01f9e80
Sep 16 14:36:41 vmhost03 kernel: [67010.834341] FS: 0000000000000000(0000) GS:ffff97b2deb00000(0000) knlGS:0000000000000000
Sep 16 14:36:41 vmhost03 kernel: [67010.834342] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 16 14:36:41 vmhost03 kernel: [67010.834343] CR2: 000000000121a7e0 CR3: 00000006d3a0e001 CR4: 00000000003626e0
Sep 16 14:36:41 vmhost03 kernel: [67010.834343] Call Trace:
Sep 16 14:36:41 vmhost03 kernel: [67010.834345] <IRQ>
Sep 16 14:36:41 vmhost03 kernel: [67010.834348] ? pfifo_fast_enqueue+0x120/0x120
Sep 16 14:36:41 vmhost03 kernel: [67010.834351] call_timer_fn+0x30/0x130
Sep 16 14:36:41 vmhost03 kernel: [67010.834353] run_timer_softirq+0x3e4/0x420
Sep 16 14:36:41 vmhost03 kernel: [67010.834355] ? ktime_get+0x3c/0xa0
Sep 16 14:36:41 vmhost03 kernel: [67010.834357] ? lapic_next_deadline+0x26/0x30
Sep 16 14:36:41 vmhost03 kernel: [67010.834359] ? clockevents_program_event+0x93/0xf0
Sep 16 14:36:41 vmhost03 kernel: [67010.834362] __do_softirq+0xdc/0x2f3
Sep 16 14:36:41 vmhost03 kernel: [67010.834364] irq_exit+0xc0/0xd0
Sep 16 14:36:41 vmhost03 kernel: [67010.834366] smp_apic_timer_interrupt+0x79/0x140
Sep 16 14:36:41 vmhost03 kernel: [67010.834368] apic_timer_interrupt+0xf/0x20
Sep 16 14:36:41 vmhost03 kernel: [67010.834368] </IRQ>
Sep 16 14:36:41 vmhost03 kernel: [67010.834370] RIP: 0010:cpuidle_enter_state+0xbd/0x450
Sep 16 14:36:41 vmhost03 kernel: [67010.834371] Code: ff e8 47 27 86 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 63 03 00 00 31 ff e8 2a 57 8c ff fb 66 0f 1f 44 00 00 <
45> 85 ed 0f 88 8d 02 00 00 49 63 cd 48 8b 75 d0 48 2b 75 c8 48 8d
Sep 16 14:36:41 vmhost03 kernel: [67010.834372] RSP: 0018:ffffb350431e7e60 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Sep 16 14:36:41 vmhost03 kernel: [67010.834373] RAX: ffff97b2deb22d80 RBX: ffffffffab553d40 RCX: 000000000000001f
Sep 16 14:36:41 vmhost03 kernel: [67010.834374] RDX: 00003cf22cf8fd23 RSI: 0000000035555555 RDI: 0000000000000000
Sep 16 14:36:41 vmhost03 kernel: [67010.834375] RBP: ffffb350431e7ea0 R08: 0000000000000000 R09: 0000000000022640
Sep 16 14:36:41 vmhost03 kernel: [67010.834375] R10: 0000924bca543948 R11: ffff97b2deb21c04 R12: ffff97b2deb2cd00
Sep 16 14:36:41 vmhost03 kernel: [67010.834376] R13: 0000000000000006 R14: ffffffffab553f98 R15: ffffffffab553f80
Sep 16 14:36:41 vmhost03 kernel: [67010.834378] cpuidle_enter+0x17/0x20
Sep 16 14:36:41 vmhost03 kernel: [67010.834381] call_cpuidle+0x23/0x40
Sep 16 14:36:41 vmhost03 kernel: [67010.834382] do_idle+0x23a/0x280
Sep 16 14:36:41 vmhost03 kernel: [67010.834384] cpu_startup_entry+0x1d/0x20
Sep 16 14:36:41 vmhost03 kernel: [67010.834386] start_secondary+0x1ab/0x200
Sep 16 14:36:41 vmhost03 kernel: [67010.834388] secondary_startup_64+0xa4/0xb0
Sep 16 14:36:41 vmhost03 kernel: [67010.834390] ---[ end trace 25fa321422d7a98c ]---

n1nj4888 · Oct 15, 2019

I'm seeing this same issue on both my Intel NUC8i5BEH units (Intel® Ethernet Connection I219-V) - Both run PVE6 and produce the same set of "e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang" error messages as above in /var/log/syslog, sometimes in excess of 150 times a day...

I assume this is a driver issue (given it happens on both hardware units)?

aSpeX · Nov 8, 2019

Hey,
same problem for me. E1000E, "eno1: Detected Hardware Unit Hang" with kernel 5.0.21-2-pve and 5.3.7-1-pve.
Sometimes the connection doesn't come back and I have to reboot the node.
Problem appeared since I had to reinstall Proxmox PVE, worked without a problem on the old version.
Any solutions yet?

bogo22 · Nov 8, 2019

I got the same error, see my forum post
It looks like many users got the problem, see: bug tracker on kernel.org also on newer kernel 5.0+
I built new .ko file with intel out-of-kerneltree driver e1000e v3.6.0 (DKMS or intel source)...lets hope is solves the hangs... will give feedback

bogo22 · Nov 9, 2019

Still got hangs and also a call trace this time:

Code:

[50316.713031] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                 TDH                  <71>
                 TDT                  <9a>
                 next_to_use          <9a>
                 next_to_clean        <70>
               buffer_info[next_to_clean]:
                 time_stamp           <100becc36>
                 next_to_watch        <71>
                 jiffies              <100becd68>
                 next_to_watch.status <0>
               MAC Status             <40080083>
               PHY Status             <796d>
               PHY 1000BASE-T Status  <3800>
               PHY Extended Status    <3000>
               PCI Status             <10>
[50318.729038] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                 TDH                  <71>
                 TDT                  <9a>
                 next_to_use          <9a>
                 next_to_clean        <70>
               buffer_info[next_to_clean]:
                 time_stamp           <100becc36>
                 next_to_watch        <71>
                 jiffies              <100becf60>
                 next_to_watch.status <0>
               MAC Status             <40080083>
               PHY Status             <796d>
               PHY 1000BASE-T Status  <3800>
               PHY Extended Status    <3000>
               PCI Status             <10>
[50320.749008] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                 TDH                  <71>
                 TDT                  <9a>
                 next_to_use          <9a>
                 next_to_clean        <70>
               buffer_info[next_to_clean]:
                 time_stamp           <100becc36>
                 next_to_watch        <71>
                 jiffies              <100bed159>
                 next_to_watch.status <0>
               MAC Status             <40080083>
               PHY Status             <796d>
               PHY 1000BASE-T Status  <3800>
               PHY Extended Status    <3000>
               PCI Status             <10>
[50322.760939] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                 TDH                  <71>
                 TDT                  <9a>
                 next_to_use          <9a>
                 next_to_clean        <70>
               buffer_info[next_to_clean]:
                 time_stamp           <100becc36>
                 next_to_watch        <71>
                 jiffies              <100bed350>
                 next_to_watch.status <0>
               MAC Status             <40080083>
               PHY Status             <796d>
               PHY 1000BASE-T Status  <3800>
               PHY Extended Status    <3000>
               PCI Status             <10>
[50322.952657] ------------[ cut here ]------------
[50322.952659] NETDEV WATCHDOG: eno1 (e1000e): transmit queue 0 timed out
[50322.952672] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:466 dev_watchdog+0x221/0x230
[50322.952673] Modules linked in: tcp_diag inet_diag binfmt_misc veth ebtable_filter ebtables ip_set ip6table_filter ip6_tables iptable_filter bpfilter softdog nfnetlink_log nfnetlink snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core snd_soc_skl_ipc snd_soc_sst_ipc i915 snd_soc_sst_dsp kvmgt vfio_mdev mdev vfio_iommu_type1 snd_soc_acpi_intel_match vfio snd_soc_acpi kvm snd_soc_core aesni_intel irqbypass snd_compress ac97_bus aes_x86_64 snd_pcm_dmaengine crypto_simd wmi_bmof snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep drm_kms_helper cryptd glue_helper intel_cstate intel_rapl_perf drm pcspkr i2c_algo_bit mei_me intel_wmi_thunderbolt fb_sys_fops snd_pcm syscopyarea snd_timer sysfillrect snd sysimgblt soundcore mei intel_pch_thermal acpi_pad mac_hid acpi_tad vhost_net vhost tap ib_iser rdma_cm iw_cm
[50322.952693]  ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zlua(PO) zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) btrfs xor zstd_compress raid6_pq libcrc32c uas usb_storage ahci e1000e i2c_i801 libahci wmi pinctrl_cannonlake video pinctrl_intel
[50322.952703] CPU: 2 PID: 0 Comm: swapper/2 Tainted: P           O      5.0.21-4-pve #1
[50322.952704] Hardware name: Intel(R) Client Systems NUC8i3BEH/NUC8BEB, BIOS BECFL357.86A.0071.2019.0510.1505 05/10/2019
[50322.952705] RIP: 0010:dev_watchdog+0x221/0x230
[50322.952706] Code: 00 49 63 4e e0 eb 92 4c 89 ef c6 05 7d 40 ee 00 01 e8 03 22 fc ff 89 d9 4c 89 ee 48 c7 c7 e0 aa 1d a8 48 89 c2 e8 71 4f 77 ff <0f> 0b eb c0 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
[50322.952707] RSP: 0018:ffff9ae8adb03e68 EFLAGS: 00010286
[50322.952707] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
[50322.952708] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff9ae8adb16440
[50322.952708] RBP: ffff9ae8adb03e98 R08: 0000000000000001 R09: 00000000000003cb
[50322.952709] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000001
[50322.952709] R13: ffff9ae8a192c000 R14: ffff9ae8a192c4c0 R15: ffff9ae8a1c75280
[50322.952710] FS:  0000000000000000(0000) GS:ffff9ae8adb00000(0000) knlGS:0000000000000000
[50322.952711] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50322.952711] CR2: 0000000001880000 CR3: 000000017560e003 CR4: 00000000003626e0
[50322.952711] Call Trace:
[50322.952713]  <IRQ>
[50322.952715]  ? pfifo_fast_enqueue+0x120/0x120
[50322.952717]  call_timer_fn+0x30/0x130
[50322.952718]  run_timer_softirq+0x3e4/0x420
[50322.952720]  ? ktime_get+0x40/0xa0
[50322.952721]  ? lapic_next_deadline+0x26/0x30
[50322.952723]  ? clockevents_program_event+0x93/0xf0
[50322.952724]  __do_softirq+0xdc/0x2f3
[50322.952726]  irq_exit+0xc0/0xd0
[50322.952727]  smp_apic_timer_interrupt+0x79/0x140
[50322.952728]  apic_timer_interrupt+0xf/0x20
[50322.952728]  </IRQ>
[50322.952730] RIP: 0010:cpuidle_enter_state+0xbd/0x450
[50322.952731] Code: ff e8 17 9d 85 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 63 03 00 00 31 ff e8 2a d2 8b ff fb 66 0f 1f 44 00 00 <45> 85 ed 0f 89 cf 01 00 00 41 c7 44 24 08 00 00 00 00 48 83 c4 18
[50322.952731] RSP: 0018:ffffbddc41973e60 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[50322.952732] RAX: ffff9ae8adb221c0 RBX: ffffffffa8553e60 RCX: 000000000000001f
[50322.952732] RDX: 00002dc4b9b788ec RSI: 000000002aaaaaaa RDI: 0000000000000000
[50322.952733] RBP: ffffbddc41973ea0 R08: 0000000000000000 R09: 0000000000021a80
[50322.952733] R10: 0000895ded97c40b R11: ffff9ae8adb21044 R12: ffff9ae8adb2cd00
[50322.952734] R13: 0000000000000004 R14: ffffffffa8553ff8 R15: ffffffffa8553fe0
[50322.952735]  cpuidle_enter+0x17/0x20
[50322.952737]  call_cpuidle+0x23/0x40
[50322.952737]  do_idle+0x22c/0x270
[50322.952738]  cpu_startup_entry+0x1d/0x20
[50322.952740]  start_secondary+0x1ab/0x200
[50322.952741]  secondary_startup_64+0xa4/0xb0
[50322.952742] ---[ end trace 68ab007781b80a74 ]---
[50322.952754] e1000e 0000:00:1f.6 eno1: Reset adapter unexpectedly
[50322.952881] vmbr0: port 1(eno1) entered disabled state
[50326.795286] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[50326.795367] vmbr0: port 1(eno1) entered blocking state
[50326.795370] vmbr0: port 1(eno1) entered forwarding state

[ICODE][/QUOTE]

Tacid · Nov 15, 2019

bogo22 said:
Still got hangs and also a call trace this time:

Maybe should try to disable tcp segmentation offload and generic segmentation offload

Code:

ethtook -K <interface> tso off gso off

There is known problem with Intel i218/i219 NIC and e1000e driver - buffer overrun while the I219 is processing DMA transactions. Problem was fixed in kernel 4.15 https://github.com/torvalds/linux/commit/b10effb92e272051dd1ec0d7be56bf9ca85ab927

aSpeX · Nov 25, 2019

Tacid said:
Maybe should try to disable tcp segmentation offload and generic segmentation offload

Code:

ethtook -K <interface> tso off gso off

There is known problem with Intel i218/i219 NIC and e1000e driver - buffer overrun while the I219 is processing DMA transactions. Problem was fixed in kernel 4.15 https://github.com/torvalds/linux/commit/b10effb92e272051dd1ec0d7be56bf9ca85ab927

But I'm on a 5.X-kernel and the problem's the same.

George Michalopoulos · Nov 25, 2019

This is a problem with Intel cards.
I have several servers on Hetzner, all of them with Intel cards, had the same problem..
So, i followed their advice..

https://wiki.hetzner.de/index.php/Low_performance_with_Intel_i218/i219_NIC/en

with this command, problem dissapeared..
ethtool -K <interface> tso off gso off

PVE 6.0-15

Tacid · Nov 25, 2019

aSpeX said:
But I'm on a 5.X-kernel and the problem's the same.

I'm on a 5.X-kernel too, udp buffer overrun was fixed in e1000 driver in 4.15 kernel while caused segmentaion offload limitations, so you should turn off it if you want full speed.

Did you try solution before writing answer? For me disabling in driver tcp segmentation offload (tso) and generic segmentation offload (gso) solved problem.

George Michalopoulos · Nov 29, 2019

this fixed the problem for me also, but degraded the transfer speed..

Apollon77 · Nov 29, 2019

I also run into this once with my NUC8i5BEH :-( How much the speed degraded?

George Michalopoulos · Nov 29, 2019

Ι tried to transfer the same file (103Gb, scp transfer in a server in the same data center, Hetzner) before and after running this command..

103GB 67.7MB/s 25:58 this is before
103GB 37.7MB/s 46:38 this is after

the transfer rate is almost the half...

Apollon77 · Nov 29, 2019

puuhhh ... but better then a crash I think

George Michalopoulos · Nov 30, 2019

its not crashing.. it just loses its ethernet..
an unreachable server, somewhere in Hetzner's data centers..

spirit · Dec 1, 2019

seem related:

https://bugzilla.kernel.org/show_bug.cgi?id=203175

but it seem to be fixed in kernel 5.3

https://git.kernel.org/pub/scm/linu...4&id=caff422ea81e144842bc44bab408d85ac449377b

Apollon77 · Dec 1, 2019

The bugs seems fixed in 5.2.2 ... but pve6 is on 5.0.x ... maybe the Proxmox guys could patch it themself in their kernel version? Maybe open an issue in their Bugtracker?

spirit · Dec 2, 2019

Apollon77 said:
The bugs seems fixed in 5.2.2 ... but pve6 is on 5.0.x ... maybe the Proxmox guys could patch it themself in their kernel version? Maybe open an issue in their Bugtracker?

kernel 5.3 is already available in no subscription repo. (apt install pve-kernel-5.3)

alatteri · Dec 2, 2019

I'm still getting the e1000 hangs with kernel 5.3

spirit · Dec 2, 2019

alatteri said:
I'm still getting the e1000 hangs with kernel 5.3

I'll verifiy source in proxmox git in this commit is correctly reverted.
Change in e1000e drivers are small since 1year, and I don't see other bug report/commit about this.

spirit · Dec 2, 2019

alatteri said:
I'm still getting the e1000 hangs with kernel 5.3

Which pve-kernel-5.3 version ?
it should be ok in pve-kernel-5.3.10. (I don't think it's already patched in 5.3.7)

e1000 driver hang

Renowned Member

Well-Known Member

New Member

Renowned Member

Renowned Member

Active Member

New Member

Active Member

Active Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Distinguished Member

Well-Known Member

Distinguished Member

Renowned Member

Distinguished Member

Distinguished Member

We value your privacy