Detected Hardware Unit Hang: NIC resetting unexpectedly during high throughput

jsalas424

Active Member
Jul 5, 2020
143
3
38
34
I have noticed in my syslog that during times of high throughput, I am getting this hardware hanging issue. How do I begin to troubleshoot this?

Code:
Jun 26 21:39:45 TracheNodeA corosync[1828]:   [KNET  ] link: host: 1 link: 1 is down
Jun 26 21:39:45 TracheNodeA corosync[1828]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 26 21:39:45 TracheNodeA kernel: e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
  TDH                  <49>
  TDT                  <80>
  next_to_use          <80>
  next_to_clean        <48>
buffer_info[next_to_clean]:
  time_stamp           <105c2712e>
  next_to_watch        <49>
  jiffies              <105c272c0>
  next_to_watch.status <0>
MAC Status             <80083>
PHY Status             <796d>
PHY 1000BASE-T Status  <78ff>
PHY Extended Status    <3000>
PCI Status             <10>
Jun 26 21:39:47 TracheNodeA kernel: e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
  TDH                  <49>
  TDT                  <80>
  next_to_use          <80>
  next_to_clean        <48>
buffer_info[next_to_clean]:
  time_stamp           <105c2712e>
  next_to_watch        <49>
  jiffies              <105c274b8>
  next_to_watch.status <0>
MAC Status             <80083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7800>
PHY Extended Status    <3000>
PCI Status             <10>
Jun 26 21:39:49 TracheNodeA kernel: e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
  TDH                  <49>
  TDT                  <80>
  next_to_use          <80>
  next_to_clean        <48>
buffer_info[next_to_clean]:
  time_stamp           <105c2712e>
  next_to_watch        <49>
  jiffies              <105c276a8>
  next_to_watch.status <0>
MAC Status             <80083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7800>
PHY Extended Status    <3000>
PCI Status             <10>
Jun 26 21:39:50 TracheNodeA kernel: e1000e 0000:00:19.0 eno1: Reset adapter unexpectedly
Jun 26 21:39:54 TracheNodeA kernel: e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
1624758598215.png

cross-posted to reddit: https://www.reddit.com/r/techsupport/comments/o8nu0m/detected_hardware_unit_hang_nic_resetting/
 
Last edited:
It looks like the best solution so far here has been to disable hardware offloading features and sacrifice performance, that's disappointing.

I dug around and found that I'm running is a Intel I217-LM card and PVE is running drivers: e1000e v3.2.6-k
Code:
root@NodeA:~# ethtool -i eno1 | grep -i driver
driver: e1000e
root@NodeA:~# ethtool -i eno1 | grep -i version
version: 3.2.6-k
firmware-version: 0.13-4
expansion-rom-version:

There's a newer driver (v3.8.4) available from Intel and wanted to see if that could help. I've never installed firmware on a pve build and wanted to check if there are any contraindications.
 
Last edited:
There's a newer driver (v3.8.4) available from Intel and wanted to see if that could help. I've never installed firmware on a pve build and wanted to check if there are any contraindications.
In my experience installing the out-of-tree drivers from intel is a bit hit and miss - for some cards they fix all issues, for some cards they cause issues not happening with the in-tree drivers.
I'm not aware that it should cause any major problems - however it can always happen, that a change to internal kernel interfaces causes the out-of-tree dkms drivers to fail compiling or working.
We usually don't support out-of-tree drivers in our Enterprise support if that's relevant for you.

It looks like the best solution so far here has been to disable hardware offloading features and sacrifice performance, that's disappointing.
While I never did some explicit benchmarks - I was always under the impression that in most scenarios (and the average hypervisor deployment) the performance did not suffer too much when disabling hardware offloading.

I hope this helps!
 
  • Like
Reactions: semanticbeeng
I also saw this Hardware Unit Hang in the last few days, although I have kernel 5.4.128-1 on my PVE server.
Just to get a clear picture of the status now:

As I understand, this issue was fixed in current 5.x kernels. The "downgrade" on the NIC ("ethtool -K <interface> tso off gso off" for disabling hw-offloading features) resulting in a slow down of network performance should not be necessary on current PVE installations.

is this correct?
 
Apparently it is not fixed.
Just got it from freshly installed instance:
Jan 12 12:27:07 dev06 kernel: e1000e 0000:00:1f.6 eth0: Detected Hardware Unit Hang: TDH <a5> TDT <3> next_to_use <3> next_to_clean <a5> buffer_info[next_to_clean]: time_stamp <100058dd6> next_to_watch <a6> jiffies <100059468> next_to_watch.status <0> MAC Status <40080083> PHY Status <796d> PHY 1000BASE-T Status <3800> PHY Extended Status <3000> PCI Status <10>

Linux 5.13.19-2-pve #1 SMP PVE 5.13.19-4 (Mon, 29 Nov 2021 12:10:09 +0100)

pve-manager/7.1-8/5b267f33
 
Last edited:
Yea I ran into this issue in the last month - put a cheap tp-link realtek NIC in my server for $15 rather than trying to actually fix it...since it's a driver issue as far as I can tell.
 
I ended up with:
Code:
auto eth0
iface eth0 inet static
    address XX.XX.XX.XX/XX
    gateway NN.NN.NN.NN
    offload-gso off
    offload-gro off
    offload-tso off
    offload-rx off
    offload-tx off
    offload-rxvlan off
    offload-txvlan off
    offload-sg off
    offload-ufo off
    offload-lro off

This works too:
Code:
ethtool -K eth0 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off

But in order to preserve this after reboot (or network interface restart) it is better to put it to interfaces file.
 
  • Like
Reactions: semanticbeeng
Hello. I lost connectivity to my server. I saw the light of promos port on my switch going on and off.
Turned on my display and was always getting this message or similar.
Tried several things and a reboot solved this, don't know for how long.
I haven't changed anything recently in my server, updates or configs, just was lying doinghis thing,serbing things...it was in the middle of the night.

What could b the cause?

proxmox 7.4-19 Hp G1 400 Mini integrated ethernet controller.

Thank you