Detected Hardware Unit Hang: NIC resetting unexpectedly during high throughput

jsalas424

Member
Jul 5, 2020
141
2
23
34
I have noticed in my syslog that during times of high throughput, I am getting this hardware hanging issue. How do I begin to troubleshoot this?

Code:
Jun 26 21:39:45 TracheNodeA corosync[1828]:   [KNET  ] link: host: 1 link: 1 is down
Jun 26 21:39:45 TracheNodeA corosync[1828]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 26 21:39:45 TracheNodeA kernel: e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
  TDH                  <49>
  TDT                  <80>
  next_to_use          <80>
  next_to_clean        <48>
buffer_info[next_to_clean]:
  time_stamp           <105c2712e>
  next_to_watch        <49>
  jiffies              <105c272c0>
  next_to_watch.status <0>
MAC Status             <80083>
PHY Status             <796d>
PHY 1000BASE-T Status  <78ff>
PHY Extended Status    <3000>
PCI Status             <10>
Jun 26 21:39:47 TracheNodeA kernel: e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
  TDH                  <49>
  TDT                  <80>
  next_to_use          <80>
  next_to_clean        <48>
buffer_info[next_to_clean]:
  time_stamp           <105c2712e>
  next_to_watch        <49>
  jiffies              <105c274b8>
  next_to_watch.status <0>
MAC Status             <80083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7800>
PHY Extended Status    <3000>
PCI Status             <10>
Jun 26 21:39:49 TracheNodeA kernel: e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
  TDH                  <49>
  TDT                  <80>
  next_to_use          <80>
  next_to_clean        <48>
buffer_info[next_to_clean]:
  time_stamp           <105c2712e>
  next_to_watch        <49>
  jiffies              <105c276a8>
  next_to_watch.status <0>
MAC Status             <80083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7800>
PHY Extended Status    <3000>
PCI Status             <10>
Jun 26 21:39:50 TracheNodeA kernel: e1000e 0000:00:19.0 eno1: Reset adapter unexpectedly
Jun 26 21:39:54 TracheNodeA kernel: e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
1624758598215.png

cross-posted to reddit: https://www.reddit.com/r/techsupport/comments/o8nu0m/detected_hardware_unit_hang_nic_resetting/
 
Last edited:
It looks like the best solution so far here has been to disable hardware offloading features and sacrifice performance, that's disappointing.

I dug around and found that I'm running is a Intel I217-LM card and PVE is running drivers: e1000e v3.2.6-k
Code:
root@NodeA:~# ethtool -i eno1 | grep -i driver
driver: e1000e
root@NodeA:~# ethtool -i eno1 | grep -i version
version: 3.2.6-k
firmware-version: 0.13-4
expansion-rom-version:

There's a newer driver (v3.8.4) available from Intel and wanted to see if that could help. I've never installed firmware on a pve build and wanted to check if there are any contraindications.
 
Last edited:
There's a newer driver (v3.8.4) available from Intel and wanted to see if that could help. I've never installed firmware on a pve build and wanted to check if there are any contraindications.
In my experience installing the out-of-tree drivers from intel is a bit hit and miss - for some cards they fix all issues, for some cards they cause issues not happening with the in-tree drivers.
I'm not aware that it should cause any major problems - however it can always happen, that a change to internal kernel interfaces causes the out-of-tree dkms drivers to fail compiling or working.
We usually don't support out-of-tree drivers in our Enterprise support if that's relevant for you.

It looks like the best solution so far here has been to disable hardware offloading features and sacrifice performance, that's disappointing.
While I never did some explicit benchmarks - I was always under the impression that in most scenarios (and the average hypervisor deployment) the performance did not suffer too much when disabling hardware offloading.

I hope this helps!
 
  • Like
Reactions: semanticbeeng
I also saw this Hardware Unit Hang in the last few days, although I have kernel 5.4.128-1 on my PVE server.
Just to get a clear picture of the status now:

As I understand, this issue was fixed in current 5.x kernels. The "downgrade" on the NIC ("ethtool -K <interface> tso off gso off" for disabling hw-offloading features) resulting in a slow down of network performance should not be necessary on current PVE installations.

is this correct?
 
Apparently it is not fixed.
Just got it from freshly installed instance:
Jan 12 12:27:07 dev06 kernel: e1000e 0000:00:1f.6 eth0: Detected Hardware Unit Hang: TDH <a5> TDT <3> next_to_use <3> next_to_clean <a5> buffer_info[next_to_clean]: time_stamp <100058dd6> next_to_watch <a6> jiffies <100059468> next_to_watch.status <0> MAC Status <40080083> PHY Status <796d> PHY 1000BASE-T Status <3800> PHY Extended Status <3000> PCI Status <10>

Linux 5.13.19-2-pve #1 SMP PVE 5.13.19-4 (Mon, 29 Nov 2021 12:10:09 +0100)

pve-manager/7.1-8/5b267f33
 
Last edited:
Yea I ran into this issue in the last month - put a cheap tp-link realtek NIC in my server for $15 rather than trying to actually fix it...since it's a driver issue as far as I can tell.
 
I ended up with:
Code:
auto eth0
iface eth0 inet static
    address XX.XX.XX.XX/XX
    gateway NN.NN.NN.NN
    offload-gso off
    offload-gro off
    offload-tso off
    offload-rx off
    offload-tx off
    offload-rxvlan off
    offload-txvlan off
    offload-sg off
    offload-ufo off
    offload-lro off

This works too:
Code:
ethtool -K eth0 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off

But in order to preserve this after reboot (or network interface restart) it is better to put it to interfaces file.
 
  • Like
Reactions: semanticbeeng

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!