e1000 driver hang

alatteri

Well-Known Member
In the past week we are seeing random e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang failuresacross all our nodes, even different hardware hosts. Must do a reset of the host.

There are lots of references to this issue going back 5+ years. Was there a driver change with the latest updates? We've run years with this hardware without issue. Now just this week its popping up all over.

Kernel Version Linux 5.0.21-2-pve #1 SMP PVE 5.0.21-3 (Thu, 05 Sep 2019 13:56:01 +0200)
PVE Manager Version pve-manager/6.0-7/2898402

Sep 22 20:03:08 vmhost03 kernel: [154458.471981] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] TDH <39>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] TDT <8f>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] next_to_use <8f>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] next_to_clean <39>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] buffer_info[next_to_clean]:
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] time_stamp <1024c0b53>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] next_to_watch <3a>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] jiffies <1024c11f0>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] next_to_watch.status <0>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] MAC Status <40080083>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] PHY Status <796d>

Sep 16 14:36:41 vmhost03 kernel: [67010.834277] NETDEV WATCHDOG: eno1 (e1000e): transmit queue 0 timed out
Sep 16 14:36:41 vmhost03 kernel: [67010.834295] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x221/0x230
Sep 16 14:36:41 vmhost03 kernel: [67010.834295] Modules linked in: veth arc4 md4 cmac nls_utf8 cifs ccm fscache ebtable_filter ebtables ip_set ip6table_filter ip6_tables sctp iptabl
e_filter bpfilter softdog nfnetlink_log nfnetlink intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel crct10dif_pclmul crc32_pclmul ghash_clmulni_intel nls_iso8859_1
aesni_intel zfs(PO) aes_x86_64 crypto_simd cryptd glue_helper zunicode(PO) zlua(PO) intel_cstate i915 kvmgt intel_rapl_perf snd_pcm vfio_mdev mdev vfio_iommu_type1 snd_timer vfio s
nd soundcore pcspkr kvm wmi_bmof irqbypass intel_wmi_thunderbolt drm_kms_helper drm intel_xhci_usb_role_switch i2c_algo_bit mei_me roles fb_sys_fops syscopyarea sysfillrect mei sysi
mgblt intel_pch_thermal acpi_pad mac_hid zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi sunr
pc scsi_transport_iscsi ip_tables x_tables autofs4 xfs btrfs xor zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison
Sep 16 14:36:41 vmhost03 kernel: [67010.834327] dm_bufio libcrc32c i2c_i801 ahci e1000e libahci wmi video
Sep 16 14:36:41 vmhost03 kernel: [67010.834331] CPU: 2 PID: 0 Comm: swapper/2 Tainted: P O 5.0.21-1-pve #1
Sep 16 14:36:41 vmhost03 kernel: [67010.834332] Hardware name: Intel Corporation NUC7i3BNK/NUC7i3BNB, BIOS BNKBL357.86A.0080.2019.0725.1139 07/25/2019
Sep 16 14:36:41 vmhost03 kernel: [67010.834334] RIP: 0010:dev_watchdog+0x221/0x230
Sep 16 14:36:41 vmhost03 kernel: [67010.834335] Code: 00 49 63 4e e0 eb 92 4c 89 ef c6 05 0b a2 ef 00 01 e8 f3 2a fc ff 89 d9 4c 89 ee 48 c7 c7 30 0a 1b ab 48 89 c2 e8 b1 d5 78 ff <
0f> 0b eb c0 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
Sep 16 14:36:41 vmhost03 kernel: [67010.834336] RSP: 0018:ffff97b2deb03e68 EFLAGS: 00010286
Sep 16 14:36:41 vmhost03 kernel: [67010.834337] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
Sep 16 14:36:41 vmhost03 kernel: [67010.834338] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff97b2deb16440
Sep 16 14:36:41 vmhost03 kernel: [67010.834339] RBP: ffff97b2deb03e98 R08: 0000000000000001 R09: 00000000000003ca
Sep 16 14:36:41 vmhost03 kernel: [67010.834339] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000001
Sep 16 14:36:41 vmhost03 kernel: [67010.834340] R13: ffff97b2cf570000 R14: ffff97b2cf5704c0 R15: ffff97b2d01f9e80
Sep 16 14:36:41 vmhost03 kernel: [67010.834341] FS: 0000000000000000(0000) GS:ffff97b2deb00000(0000) knlGS:0000000000000000
Sep 16 14:36:41 vmhost03 kernel: [67010.834342] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 16 14:36:41 vmhost03 kernel: [67010.834343] CR2: 000000000121a7e0 CR3: 00000006d3a0e001 CR4: 00000000003626e0
Sep 16 14:36:41 vmhost03 kernel: [67010.834343] Call Trace:
Sep 16 14:36:41 vmhost03 kernel: [67010.834345] <IRQ>
Sep 16 14:36:41 vmhost03 kernel: [67010.834348] ? pfifo_fast_enqueue+0x120/0x120
Sep 16 14:36:41 vmhost03 kernel: [67010.834351] call_timer_fn+0x30/0x130
Sep 16 14:36:41 vmhost03 kernel: [67010.834353] run_timer_softirq+0x3e4/0x420
Sep 16 14:36:41 vmhost03 kernel: [67010.834355] ? ktime_get+0x3c/0xa0
Sep 16 14:36:41 vmhost03 kernel: [67010.834357] ? lapic_next_deadline+0x26/0x30
Sep 16 14:36:41 vmhost03 kernel: [67010.834359] ? clockevents_program_event+0x93/0xf0
Sep 16 14:36:41 vmhost03 kernel: [67010.834362] __do_softirq+0xdc/0x2f3
Sep 16 14:36:41 vmhost03 kernel: [67010.834364] irq_exit+0xc0/0xd0
Sep 16 14:36:41 vmhost03 kernel: [67010.834366] smp_apic_timer_interrupt+0x79/0x140
Sep 16 14:36:41 vmhost03 kernel: [67010.834368] apic_timer_interrupt+0xf/0x20
Sep 16 14:36:41 vmhost03 kernel: [67010.834368] </IRQ>
Sep 16 14:36:41 vmhost03 kernel: [67010.834370] RIP: 0010:cpuidle_enter_state+0xbd/0x450
Sep 16 14:36:41 vmhost03 kernel: [67010.834371] Code: ff e8 47 27 86 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 63 03 00 00 31 ff e8 2a 57 8c ff fb 66 0f 1f 44 00 00 <
45> 85 ed 0f 88 8d 02 00 00 49 63 cd 48 8b 75 d0 48 2b 75 c8 48 8d
Sep 16 14:36:41 vmhost03 kernel: [67010.834372] RSP: 0018:ffffb350431e7e60 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Sep 16 14:36:41 vmhost03 kernel: [67010.834373] RAX: ffff97b2deb22d80 RBX: ffffffffab553d40 RCX: 000000000000001f
Sep 16 14:36:41 vmhost03 kernel: [67010.834374] RDX: 00003cf22cf8fd23 RSI: 0000000035555555 RDI: 0000000000000000
Sep 16 14:36:41 vmhost03 kernel: [67010.834375] RBP: ffffb350431e7ea0 R08: 0000000000000000 R09: 0000000000022640
Sep 16 14:36:41 vmhost03 kernel: [67010.834375] R10: 0000924bca543948 R11: ffff97b2deb21c04 R12: ffff97b2deb2cd00
Sep 16 14:36:41 vmhost03 kernel: [67010.834376] R13: 0000000000000006 R14: ffffffffab553f98 R15: ffffffffab553f80
Sep 16 14:36:41 vmhost03 kernel: [67010.834378] cpuidle_enter+0x17/0x20
Sep 16 14:36:41 vmhost03 kernel: [67010.834381] call_cpuidle+0x23/0x40
Sep 16 14:36:41 vmhost03 kernel: [67010.834382] do_idle+0x23a/0x280
Sep 16 14:36:41 vmhost03 kernel: [67010.834384] cpu_startup_entry+0x1d/0x20
Sep 16 14:36:41 vmhost03 kernel: [67010.834386] start_secondary+0x1ab/0x200
Sep 16 14:36:41 vmhost03 kernel: [67010.834388] secondary_startup_64+0xa4/0xb0
Sep 16 14:36:41 vmhost03 kernel: [67010.834390] ---[ end trace 25fa321422d7a98c ]---
 
  • Like
Reactions: semanticbeeng
I'm seeing this same issue on both my Intel NUC8i5BEH units (Intel® Ethernet Connection I219-V) - Both run PVE6 and produce the same set of "e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang" error messages as above in /var/log/syslog, sometimes in excess of 150 times a day...

I assume this is a driver issue (given it happens on both hardware units)?
 
Hey,
same problem for me. E1000E, "eno1: Detected Hardware Unit Hang" with kernel 5.0.21-2-pve and 5.3.7-1-pve.
Sometimes the connection doesn't come back and I have to reboot the node.
Problem appeared since I had to reinstall Proxmox PVE, worked without a problem on the old version.
Any solutions yet?
 
Still got hangs and also a call trace this time:

Code:
[50316.713031] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                 TDH                  <71>
                 TDT                  <9a>
                 next_to_use          <9a>
                 next_to_clean        <70>
               buffer_info[next_to_clean]:
                 time_stamp           <100becc36>
                 next_to_watch        <71>
                 jiffies              <100becd68>
                 next_to_watch.status <0>
               MAC Status             <40080083>
               PHY Status             <796d>
               PHY 1000BASE-T Status  <3800>
               PHY Extended Status    <3000>
               PCI Status             <10>
[50318.729038] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                 TDH                  <71>
                 TDT                  <9a>
                 next_to_use          <9a>
                 next_to_clean        <70>
               buffer_info[next_to_clean]:
                 time_stamp           <100becc36>
                 next_to_watch        <71>
                 jiffies              <100becf60>
                 next_to_watch.status <0>
               MAC Status             <40080083>
               PHY Status             <796d>
               PHY 1000BASE-T Status  <3800>
               PHY Extended Status    <3000>
               PCI Status             <10>
[50320.749008] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                 TDH                  <71>
                 TDT                  <9a>
                 next_to_use          <9a>
                 next_to_clean        <70>
               buffer_info[next_to_clean]:
                 time_stamp           <100becc36>
                 next_to_watch        <71>
                 jiffies              <100bed159>
                 next_to_watch.status <0>
               MAC Status             <40080083>
               PHY Status             <796d>
               PHY 1000BASE-T Status  <3800>
               PHY Extended Status    <3000>
               PCI Status             <10>
[50322.760939] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                 TDH                  <71>
                 TDT                  <9a>
                 next_to_use          <9a>
                 next_to_clean        <70>
               buffer_info[next_to_clean]:
                 time_stamp           <100becc36>
                 next_to_watch        <71>
                 jiffies              <100bed350>
                 next_to_watch.status <0>
               MAC Status             <40080083>
               PHY Status             <796d>
               PHY 1000BASE-T Status  <3800>
               PHY Extended Status    <3000>
               PCI Status             <10>
[50322.952657] ------------[ cut here ]------------
[50322.952659] NETDEV WATCHDOG: eno1 (e1000e): transmit queue 0 timed out
[50322.952672] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:466 dev_watchdog+0x221/0x230
[50322.952673] Modules linked in: tcp_diag inet_diag binfmt_misc veth ebtable_filter ebtables ip_set ip6table_filter ip6_tables iptable_filter bpfilter softdog nfnetlink_log nfnetlink snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core snd_soc_skl_ipc snd_soc_sst_ipc i915 snd_soc_sst_dsp kvmgt vfio_mdev mdev vfio_iommu_type1 snd_soc_acpi_intel_match vfio snd_soc_acpi kvm snd_soc_core aesni_intel irqbypass snd_compress ac97_bus aes_x86_64 snd_pcm_dmaengine crypto_simd wmi_bmof snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep drm_kms_helper cryptd glue_helper intel_cstate intel_rapl_perf drm pcspkr i2c_algo_bit mei_me intel_wmi_thunderbolt fb_sys_fops snd_pcm syscopyarea snd_timer sysfillrect snd sysimgblt soundcore mei intel_pch_thermal acpi_pad mac_hid acpi_tad vhost_net vhost tap ib_iser rdma_cm iw_cm
[50322.952693]  ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zlua(PO) zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) btrfs xor zstd_compress raid6_pq libcrc32c uas usb_storage ahci e1000e i2c_i801 libahci wmi pinctrl_cannonlake video pinctrl_intel
[50322.952703] CPU: 2 PID: 0 Comm: swapper/2 Tainted: P           O      5.0.21-4-pve #1
[50322.952704] Hardware name: Intel(R) Client Systems NUC8i3BEH/NUC8BEB, BIOS BECFL357.86A.0071.2019.0510.1505 05/10/2019
[50322.952705] RIP: 0010:dev_watchdog+0x221/0x230
[50322.952706] Code: 00 49 63 4e e0 eb 92 4c 89 ef c6 05 7d 40 ee 00 01 e8 03 22 fc ff 89 d9 4c 89 ee 48 c7 c7 e0 aa 1d a8 48 89 c2 e8 71 4f 77 ff <0f> 0b eb c0 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
[50322.952707] RSP: 0018:ffff9ae8adb03e68 EFLAGS: 00010286
[50322.952707] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
[50322.952708] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff9ae8adb16440
[50322.952708] RBP: ffff9ae8adb03e98 R08: 0000000000000001 R09: 00000000000003cb
[50322.952709] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000001
[50322.952709] R13: ffff9ae8a192c000 R14: ffff9ae8a192c4c0 R15: ffff9ae8a1c75280
[50322.952710] FS:  0000000000000000(0000) GS:ffff9ae8adb00000(0000) knlGS:0000000000000000
[50322.952711] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50322.952711] CR2: 0000000001880000 CR3: 000000017560e003 CR4: 00000000003626e0
[50322.952711] Call Trace:
[50322.952713]  <IRQ>
[50322.952715]  ? pfifo_fast_enqueue+0x120/0x120
[50322.952717]  call_timer_fn+0x30/0x130
[50322.952718]  run_timer_softirq+0x3e4/0x420
[50322.952720]  ? ktime_get+0x40/0xa0
[50322.952721]  ? lapic_next_deadline+0x26/0x30
[50322.952723]  ? clockevents_program_event+0x93/0xf0
[50322.952724]  __do_softirq+0xdc/0x2f3
[50322.952726]  irq_exit+0xc0/0xd0
[50322.952727]  smp_apic_timer_interrupt+0x79/0x140
[50322.952728]  apic_timer_interrupt+0xf/0x20
[50322.952728]  </IRQ>
[50322.952730] RIP: 0010:cpuidle_enter_state+0xbd/0x450
[50322.952731] Code: ff e8 17 9d 85 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 63 03 00 00 31 ff e8 2a d2 8b ff fb 66 0f 1f 44 00 00 <45> 85 ed 0f 89 cf 01 00 00 41 c7 44 24 08 00 00 00 00 48 83 c4 18
[50322.952731] RSP: 0018:ffffbddc41973e60 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[50322.952732] RAX: ffff9ae8adb221c0 RBX: ffffffffa8553e60 RCX: 000000000000001f
[50322.952732] RDX: 00002dc4b9b788ec RSI: 000000002aaaaaaa RDI: 0000000000000000
[50322.952733] RBP: ffffbddc41973ea0 R08: 0000000000000000 R09: 0000000000021a80
[50322.952733] R10: 0000895ded97c40b R11: ffff9ae8adb21044 R12: ffff9ae8adb2cd00
[50322.952734] R13: 0000000000000004 R14: ffffffffa8553ff8 R15: ffffffffa8553fe0
[50322.952735]  cpuidle_enter+0x17/0x20
[50322.952737]  call_cpuidle+0x23/0x40
[50322.952737]  do_idle+0x22c/0x270
[50322.952738]  cpu_startup_entry+0x1d/0x20
[50322.952740]  start_secondary+0x1ab/0x200
[50322.952741]  secondary_startup_64+0xa4/0xb0
[50322.952742] ---[ end trace 68ab007781b80a74 ]---
[50322.952754] e1000e 0000:00:1f.6 eno1: Reset adapter unexpectedly
[50322.952881] vmbr0: port 1(eno1) entered disabled state
[50326.795286] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[50326.795367] vmbr0: port 1(eno1) entered blocking state
[50326.795370] vmbr0: port 1(eno1) entered forwarding state

[ICODE][/QUOTE]
 
  • Like
Reactions: greenie
But I'm on a 5.X-kernel and the problem's the same.

I'm on a 5.X-kernel too, udp buffer overrun was fixed in e1000 driver in 4.15 kernel while caused segmentaion offload limitations, so you should turn off it if you want full speed.

Did you try solution before writing answer? For me disabling in driver tcp segmentation offload (tso) and generic segmentation offload (gso) solved problem.
 
Last edited:
Ι tried to transfer the same file (103Gb, scp transfer in a server in the same data center, Hetzner) before and after running this command..

103GB 67.7MB/s 25:58 this is before
103GB 37.7MB/s 46:38 this is after

the transfer rate is almost the half...
 
  • Like
Reactions: semanticbeeng
The bugs seems fixed in 5.2.2 ... but pve6 is on 5.0.x ... maybe the Proxmox guys could patch it themself in their kernel version? Maybe open an issue in their Bugtracker?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!