In the past week we are seeing random e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang failuresacross all our nodes, even different hardware hosts. Must do a reset of the host.
There are lots of references to this issue going back 5+ years. Was there a driver change with the latest updates? We've run years with this hardware without issue. Now just this week its popping up all over.
Kernel Version Linux 5.0.21-2-pve #1 SMP PVE 5.0.21-3 (Thu, 05 Sep 2019 13:56:01 +0200)
PVE Manager Version pve-manager/6.0-7/2898402
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] TDH <39>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] TDT <8f>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] next_to_use <8f>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] next_to_clean <39>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] buffer_info[next_to_clean]:
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] time_stamp <1024c0b53>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] next_to_watch <3a>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] jiffies <1024c11f0>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] next_to_watch.status <0>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] MAC Status <40080083>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] PHY Status <796d>
Sep 16 14:36:41 vmhost03 kernel: [67010.834277] NETDEV WATCHDOG: eno1 (e1000e): transmit queue 0 timed out
Sep 16 14:36:41 vmhost03 kernel: [67010.834295] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x221/0x230
Sep 16 14:36:41 vmhost03 kernel: [67010.834295] Modules linked in: veth arc4 md4 cmac nls_utf8 cifs ccm fscache ebtable_filter ebtables ip_set ip6table_filter ip6_tables sctp iptabl
e_filter bpfilter softdog nfnetlink_log nfnetlink intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel crct10dif_pclmul crc32_pclmul ghash_clmulni_intel nls_iso8859_1
aesni_intel zfs(PO) aes_x86_64 crypto_simd cryptd glue_helper zunicode(PO) zlua(PO) intel_cstate i915 kvmgt intel_rapl_perf snd_pcm vfio_mdev mdev vfio_iommu_type1 snd_timer vfio s
nd soundcore pcspkr kvm wmi_bmof irqbypass intel_wmi_thunderbolt drm_kms_helper drm intel_xhci_usb_role_switch i2c_algo_bit mei_me roles fb_sys_fops syscopyarea sysfillrect mei sysi
mgblt intel_pch_thermal acpi_pad mac_hid zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi sunr
pc scsi_transport_iscsi ip_tables x_tables autofs4 xfs btrfs xor zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison
Sep 16 14:36:41 vmhost03 kernel: [67010.834327] dm_bufio libcrc32c i2c_i801 ahci e1000e libahci wmi video
Sep 16 14:36:41 vmhost03 kernel: [67010.834331] CPU: 2 PID: 0 Comm: swapper/2 Tainted: P O 5.0.21-1-pve #1
Sep 16 14:36:41 vmhost03 kernel: [67010.834332] Hardware name: Intel Corporation NUC7i3BNK/NUC7i3BNB, BIOS BNKBL357.86A.0080.2019.0725.1139 07/25/2019
Sep 16 14:36:41 vmhost03 kernel: [67010.834334] RIP: 0010:dev_watchdog+0x221/0x230
Sep 16 14:36:41 vmhost03 kernel: [67010.834335] Code: 00 49 63 4e e0 eb 92 4c 89 ef c6 05 0b a2 ef 00 01 e8 f3 2a fc ff 89 d9 4c 89 ee 48 c7 c7 30 0a 1b ab 48 89 c2 e8 b1 d5 78 ff <
0f> 0b eb c0 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
Sep 16 14:36:41 vmhost03 kernel: [67010.834336] RSP: 0018:ffff97b2deb03e68 EFLAGS: 00010286
Sep 16 14:36:41 vmhost03 kernel: [67010.834337] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
Sep 16 14:36:41 vmhost03 kernel: [67010.834338] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff97b2deb16440
Sep 16 14:36:41 vmhost03 kernel: [67010.834339] RBP: ffff97b2deb03e98 R08: 0000000000000001 R09: 00000000000003ca
Sep 16 14:36:41 vmhost03 kernel: [67010.834339] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000001
Sep 16 14:36:41 vmhost03 kernel: [67010.834340] R13: ffff97b2cf570000 R14: ffff97b2cf5704c0 R15: ffff97b2d01f9e80
Sep 16 14:36:41 vmhost03 kernel: [67010.834341] FS: 0000000000000000(0000) GS:ffff97b2deb00000(0000) knlGS:0000000000000000
Sep 16 14:36:41 vmhost03 kernel: [67010.834342] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 16 14:36:41 vmhost03 kernel: [67010.834343] CR2: 000000000121a7e0 CR3: 00000006d3a0e001 CR4: 00000000003626e0
Sep 16 14:36:41 vmhost03 kernel: [67010.834343] Call Trace:
Sep 16 14:36:41 vmhost03 kernel: [67010.834345] <IRQ>
Sep 16 14:36:41 vmhost03 kernel: [67010.834348] ? pfifo_fast_enqueue+0x120/0x120
Sep 16 14:36:41 vmhost03 kernel: [67010.834351] call_timer_fn+0x30/0x130
Sep 16 14:36:41 vmhost03 kernel: [67010.834353] run_timer_softirq+0x3e4/0x420
Sep 16 14:36:41 vmhost03 kernel: [67010.834355] ? ktime_get+0x3c/0xa0
Sep 16 14:36:41 vmhost03 kernel: [67010.834357] ? lapic_next_deadline+0x26/0x30
Sep 16 14:36:41 vmhost03 kernel: [67010.834359] ? clockevents_program_event+0x93/0xf0
Sep 16 14:36:41 vmhost03 kernel: [67010.834362] __do_softirq+0xdc/0x2f3
Sep 16 14:36:41 vmhost03 kernel: [67010.834364] irq_exit+0xc0/0xd0
Sep 16 14:36:41 vmhost03 kernel: [67010.834366] smp_apic_timer_interrupt+0x79/0x140
Sep 16 14:36:41 vmhost03 kernel: [67010.834368] apic_timer_interrupt+0xf/0x20
Sep 16 14:36:41 vmhost03 kernel: [67010.834368] </IRQ>
Sep 16 14:36:41 vmhost03 kernel: [67010.834370] RIP: 0010:cpuidle_enter_state+0xbd/0x450
Sep 16 14:36:41 vmhost03 kernel: [67010.834371] Code: ff e8 47 27 86 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 63 03 00 00 31 ff e8 2a 57 8c ff fb 66 0f 1f 44 00 00 <
45> 85 ed 0f 88 8d 02 00 00 49 63 cd 48 8b 75 d0 48 2b 75 c8 48 8d
Sep 16 14:36:41 vmhost03 kernel: [67010.834372] RSP: 0018:ffffb350431e7e60 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Sep 16 14:36:41 vmhost03 kernel: [67010.834373] RAX: ffff97b2deb22d80 RBX: ffffffffab553d40 RCX: 000000000000001f
Sep 16 14:36:41 vmhost03 kernel: [67010.834374] RDX: 00003cf22cf8fd23 RSI: 0000000035555555 RDI: 0000000000000000
Sep 16 14:36:41 vmhost03 kernel: [67010.834375] RBP: ffffb350431e7ea0 R08: 0000000000000000 R09: 0000000000022640
Sep 16 14:36:41 vmhost03 kernel: [67010.834375] R10: 0000924bca543948 R11: ffff97b2deb21c04 R12: ffff97b2deb2cd00
Sep 16 14:36:41 vmhost03 kernel: [67010.834376] R13: 0000000000000006 R14: ffffffffab553f98 R15: ffffffffab553f80
Sep 16 14:36:41 vmhost03 kernel: [67010.834378] cpuidle_enter+0x17/0x20
Sep 16 14:36:41 vmhost03 kernel: [67010.834381] call_cpuidle+0x23/0x40
Sep 16 14:36:41 vmhost03 kernel: [67010.834382] do_idle+0x23a/0x280
Sep 16 14:36:41 vmhost03 kernel: [67010.834384] cpu_startup_entry+0x1d/0x20
Sep 16 14:36:41 vmhost03 kernel: [67010.834386] start_secondary+0x1ab/0x200
Sep 16 14:36:41 vmhost03 kernel: [67010.834388] secondary_startup_64+0xa4/0xb0
Sep 16 14:36:41 vmhost03 kernel: [67010.834390] ---[ end trace 25fa321422d7a98c ]---
There are lots of references to this issue going back 5+ years. Was there a driver change with the latest updates? We've run years with this hardware without issue. Now just this week its popping up all over.
Kernel Version Linux 5.0.21-2-pve #1 SMP PVE 5.0.21-3 (Thu, 05 Sep 2019 13:56:01 +0200)
PVE Manager Version pve-manager/6.0-7/2898402
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] TDH <39>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] TDT <8f>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] next_to_use <8f>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] next_to_clean <39>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] buffer_info[next_to_clean]:
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] time_stamp <1024c0b53>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] next_to_watch <3a>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] jiffies <1024c11f0>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] next_to_watch.status <0>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] MAC Status <40080083>
Sep 22 20:03:08 vmhost03 kernel: [154458.471981] PHY Status <796d>
Sep 16 14:36:41 vmhost03 kernel: [67010.834277] NETDEV WATCHDOG: eno1 (e1000e): transmit queue 0 timed out
Sep 16 14:36:41 vmhost03 kernel: [67010.834295] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x221/0x230
Sep 16 14:36:41 vmhost03 kernel: [67010.834295] Modules linked in: veth arc4 md4 cmac nls_utf8 cifs ccm fscache ebtable_filter ebtables ip_set ip6table_filter ip6_tables sctp iptabl
e_filter bpfilter softdog nfnetlink_log nfnetlink intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel crct10dif_pclmul crc32_pclmul ghash_clmulni_intel nls_iso8859_1
aesni_intel zfs(PO) aes_x86_64 crypto_simd cryptd glue_helper zunicode(PO) zlua(PO) intel_cstate i915 kvmgt intel_rapl_perf snd_pcm vfio_mdev mdev vfio_iommu_type1 snd_timer vfio s
nd soundcore pcspkr kvm wmi_bmof irqbypass intel_wmi_thunderbolt drm_kms_helper drm intel_xhci_usb_role_switch i2c_algo_bit mei_me roles fb_sys_fops syscopyarea sysfillrect mei sysi
mgblt intel_pch_thermal acpi_pad mac_hid zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi sunr
pc scsi_transport_iscsi ip_tables x_tables autofs4 xfs btrfs xor zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison
Sep 16 14:36:41 vmhost03 kernel: [67010.834327] dm_bufio libcrc32c i2c_i801 ahci e1000e libahci wmi video
Sep 16 14:36:41 vmhost03 kernel: [67010.834331] CPU: 2 PID: 0 Comm: swapper/2 Tainted: P O 5.0.21-1-pve #1
Sep 16 14:36:41 vmhost03 kernel: [67010.834332] Hardware name: Intel Corporation NUC7i3BNK/NUC7i3BNB, BIOS BNKBL357.86A.0080.2019.0725.1139 07/25/2019
Sep 16 14:36:41 vmhost03 kernel: [67010.834334] RIP: 0010:dev_watchdog+0x221/0x230
Sep 16 14:36:41 vmhost03 kernel: [67010.834335] Code: 00 49 63 4e e0 eb 92 4c 89 ef c6 05 0b a2 ef 00 01 e8 f3 2a fc ff 89 d9 4c 89 ee 48 c7 c7 30 0a 1b ab 48 89 c2 e8 b1 d5 78 ff <
0f> 0b eb c0 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
Sep 16 14:36:41 vmhost03 kernel: [67010.834336] RSP: 0018:ffff97b2deb03e68 EFLAGS: 00010286
Sep 16 14:36:41 vmhost03 kernel: [67010.834337] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
Sep 16 14:36:41 vmhost03 kernel: [67010.834338] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff97b2deb16440
Sep 16 14:36:41 vmhost03 kernel: [67010.834339] RBP: ffff97b2deb03e98 R08: 0000000000000001 R09: 00000000000003ca
Sep 16 14:36:41 vmhost03 kernel: [67010.834339] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000001
Sep 16 14:36:41 vmhost03 kernel: [67010.834340] R13: ffff97b2cf570000 R14: ffff97b2cf5704c0 R15: ffff97b2d01f9e80
Sep 16 14:36:41 vmhost03 kernel: [67010.834341] FS: 0000000000000000(0000) GS:ffff97b2deb00000(0000) knlGS:0000000000000000
Sep 16 14:36:41 vmhost03 kernel: [67010.834342] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 16 14:36:41 vmhost03 kernel: [67010.834343] CR2: 000000000121a7e0 CR3: 00000006d3a0e001 CR4: 00000000003626e0
Sep 16 14:36:41 vmhost03 kernel: [67010.834343] Call Trace:
Sep 16 14:36:41 vmhost03 kernel: [67010.834345] <IRQ>
Sep 16 14:36:41 vmhost03 kernel: [67010.834348] ? pfifo_fast_enqueue+0x120/0x120
Sep 16 14:36:41 vmhost03 kernel: [67010.834351] call_timer_fn+0x30/0x130
Sep 16 14:36:41 vmhost03 kernel: [67010.834353] run_timer_softirq+0x3e4/0x420
Sep 16 14:36:41 vmhost03 kernel: [67010.834355] ? ktime_get+0x3c/0xa0
Sep 16 14:36:41 vmhost03 kernel: [67010.834357] ? lapic_next_deadline+0x26/0x30
Sep 16 14:36:41 vmhost03 kernel: [67010.834359] ? clockevents_program_event+0x93/0xf0
Sep 16 14:36:41 vmhost03 kernel: [67010.834362] __do_softirq+0xdc/0x2f3
Sep 16 14:36:41 vmhost03 kernel: [67010.834364] irq_exit+0xc0/0xd0
Sep 16 14:36:41 vmhost03 kernel: [67010.834366] smp_apic_timer_interrupt+0x79/0x140
Sep 16 14:36:41 vmhost03 kernel: [67010.834368] apic_timer_interrupt+0xf/0x20
Sep 16 14:36:41 vmhost03 kernel: [67010.834368] </IRQ>
Sep 16 14:36:41 vmhost03 kernel: [67010.834370] RIP: 0010:cpuidle_enter_state+0xbd/0x450
Sep 16 14:36:41 vmhost03 kernel: [67010.834371] Code: ff e8 47 27 86 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 63 03 00 00 31 ff e8 2a 57 8c ff fb 66 0f 1f 44 00 00 <
45> 85 ed 0f 88 8d 02 00 00 49 63 cd 48 8b 75 d0 48 2b 75 c8 48 8d
Sep 16 14:36:41 vmhost03 kernel: [67010.834372] RSP: 0018:ffffb350431e7e60 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Sep 16 14:36:41 vmhost03 kernel: [67010.834373] RAX: ffff97b2deb22d80 RBX: ffffffffab553d40 RCX: 000000000000001f
Sep 16 14:36:41 vmhost03 kernel: [67010.834374] RDX: 00003cf22cf8fd23 RSI: 0000000035555555 RDI: 0000000000000000
Sep 16 14:36:41 vmhost03 kernel: [67010.834375] RBP: ffffb350431e7ea0 R08: 0000000000000000 R09: 0000000000022640
Sep 16 14:36:41 vmhost03 kernel: [67010.834375] R10: 0000924bca543948 R11: ffff97b2deb21c04 R12: ffff97b2deb2cd00
Sep 16 14:36:41 vmhost03 kernel: [67010.834376] R13: 0000000000000006 R14: ffffffffab553f98 R15: ffffffffab553f80
Sep 16 14:36:41 vmhost03 kernel: [67010.834378] cpuidle_enter+0x17/0x20
Sep 16 14:36:41 vmhost03 kernel: [67010.834381] call_cpuidle+0x23/0x40
Sep 16 14:36:41 vmhost03 kernel: [67010.834382] do_idle+0x23a/0x280
Sep 16 14:36:41 vmhost03 kernel: [67010.834384] cpu_startup_entry+0x1d/0x20
Sep 16 14:36:41 vmhost03 kernel: [67010.834386] start_secondary+0x1ab/0x200
Sep 16 14:36:41 vmhost03 kernel: [67010.834388] secondary_startup_64+0xa4/0xb0
Sep 16 14:36:41 vmhost03 kernel: [67010.834390] ---[ end trace 25fa321422d7a98c ]---