Detected Hardware Unit Hang: NIC resetting unexpectedly during high throughput

jsalas424 · Jun 27, 2021

I have noticed in my syslog that during times of high throughput, I am getting this hardware hanging issue. How do I begin to troubleshoot this?

Code:

Jun 26 21:39:45 TracheNodeA corosync[1828]:   [KNET  ] link: host: 1 link: 1 is down
Jun 26 21:39:45 TracheNodeA corosync[1828]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 26 21:39:45 TracheNodeA kernel: e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
  TDH                  <49>
  TDT                  <80>
  next_to_use          <80>
  next_to_clean        <48>
buffer_info[next_to_clean]:
  time_stamp           <105c2712e>
  next_to_watch        <49>
  jiffies              <105c272c0>
  next_to_watch.status <0>
MAC Status             <80083>
PHY Status             <796d>
PHY 1000BASE-T Status  <78ff>
PHY Extended Status    <3000>
PCI Status             <10>
Jun 26 21:39:47 TracheNodeA kernel: e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
  TDH                  <49>
  TDT                  <80>
  next_to_use          <80>
  next_to_clean        <48>
buffer_info[next_to_clean]:
  time_stamp           <105c2712e>
  next_to_watch        <49>
  jiffies              <105c274b8>
  next_to_watch.status <0>
MAC Status             <80083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7800>
PHY Extended Status    <3000>
PCI Status             <10>
Jun 26 21:39:49 TracheNodeA kernel: e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
  TDH                  <49>
  TDT                  <80>
  next_to_use          <80>
  next_to_clean        <48>
buffer_info[next_to_clean]:
  time_stamp           <105c2712e>
  next_to_watch        <49>
  jiffies              <105c276a8>
  next_to_watch.status <0>
MAC Status             <80083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7800>
PHY Extended Status    <3000>
PCI Status             <10>
Jun 26 21:39:50 TracheNodeA kernel: e1000e 0000:00:19.0 eno1: Reset adapter unexpectedly
Jun 26 21:39:54 TracheNodeA kernel: e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

cross-posted to reddit: https://www.reddit.com/r/techsupport/comments/o8nu0m/detected_hardware_unit_hang_nic_resetting/

Stoiko Ivanov · Jun 28, 2021

see - https://forum.proxmox.com/threads/intel-nic-hangs-since-last-update.90877/#post-397172
(and the linked thread there)

I hope this helps!

jsalas424 · Jun 29, 2021

Stoiko Ivanov said:
see - https://forum.proxmox.com/threads/intel-nic-hangs-since-last-update.90877/#post-397172
(and the linked thread there)

I hope this helps!

It looks like the best solution so far here has been to disable hardware offloading features and sacrifice performance, that's disappointing.

I dug around and found that I'm running is a Intel I217-LM card and PVE is running drivers: e1000e v3.2.6-k

Code:

root@NodeA:~# ethtool -i eno1 | grep -i driver
driver: e1000e
root@NodeA:~# ethtool -i eno1 | grep -i version
version: 3.2.6-k
firmware-version: 0.13-4
expansion-rom-version:

There's a newer driver (v3.8.4) available from Intel and wanted to see if that could help. I've never installed firmware on a pve build and wanted to check if there are any contraindications.

Stoiko Ivanov · Jun 29, 2021

jsalas424 said:
There's a newer driver (v3.8.4) available from Intel and wanted to see if that could help. I've never installed firmware on a pve build and wanted to check if there are any contraindications.

In my experience installing the out-of-tree drivers from intel is a bit hit and miss - for some cards they fix all issues, for some cards they cause issues not happening with the in-tree drivers.
I'm not aware that it should cause any major problems - however it can always happen, that a change to internal kernel interfaces causes the out-of-tree dkms drivers to fail compiling or working.
We usually don't support out-of-tree drivers in our Enterprise support if that's relevant for you.

jsalas424 said:
It looks like the best solution so far here has been to disable hardware offloading features and sacrifice performance, that's disappointing.

While I never did some explicit benchmarks - I was always under the impression that in most scenarios (and the average hypervisor deployment) the performance did not suffer too much when disabling hardware offloading.

I hope this helps!

thommie · Aug 21, 2021

I also saw this Hardware Unit Hang in the last few days, although I have kernel 5.4.128-1 on my PVE server.
Just to get a clear picture of the status now:

As I understand, this issue was fixed in current 5.x kernels. The "downgrade" on the NIC ("ethtool -K <interface> tso off gso off" for disabling hw-offloading features) resulting in a slow down of network performance should not be necessary on current PVE installations.

is this correct?

gallew · Jan 12, 2022

Apparently it is not fixed.
Just got it from freshly installed instance:


Jan 12 12:27:07 dev06 kernel: e1000e 0000:00:1f.6 eth0: Detected Hardware Unit Hang:
  TDH                  <a5>
  TDT                  <3>
  next_to_use          <3>
  next_to_clean        <a5>
buffer_info[next_to_clean]:
  time_stamp           <100058dd6>
  next_to_watch        <a6>
  jiffies              <100059468>
  next_to_watch.status <0>
MAC Status             <40080083>
PHY Status             <796d>
PHY 1000BASE-T Status  <3800>
PHY Extended Status    <3000>
PCI Status             <10>

Linux 5.13.19-2-pve #1 SMP PVE 5.13.19-4 (Mon, 29 Nov 2021 12:10:09 +0100)

pve-manager/7.1-8/5b267f33

CochraneServer · Jan 12, 2022

Yea I ran into this issue in the last month - put a cheap tp-link realtek NIC in my server for $15 rather than trying to actually fix it...since it's a driver issue as far as I can tell.

gallew · Jan 15, 2022

I ended up with:

Code:

auto eth0
iface eth0 inet static
    address XX.XX.XX.XX/XX
    gateway NN.NN.NN.NN
    offload-gso off
    offload-gro off
    offload-tso off
    offload-rx off
    offload-tx off
    offload-rxvlan off
    offload-txvlan off
    offload-sg off
    offload-ufo off
    offload-lro off

This works too:

Code:

ethtool -K eth0 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off

But in order to preserve this after reboot (or network interface restart) it is better to put it to interfaces file.

fmnamado · Oct 6, 2024

Hello. I lost connectivity to my server. I saw the light of promos port on my switch going on and off.
Turned on my display and was always getting this message or similar.
Tried several things and a reboot solved this, don't know for how long.
I haven't changed anything recently in my server, updates or configs, just was lying doinghis thing,serbing things...it was in the middle of the night.

What could b the cause?

proxmox 7.4-19 Hp G1 400 Mini integrated ethernet controller.

Thank you

Search

Search

Detected Hardware Unit Hang: NIC resetting unexpectedly during high throughput

jsalas424

Active Member

Stoiko Ivanov

Proxmox Staff Member

jsalas424

Active Member

Stoiko Ivanov

Proxmox Staff Member

thommie

Member

gallew

Active Member

CochraneServer

Member

gallew

Active Member

fmnamado

Member

We value your privacy