Hardware Unit Hang -> NIC

shorty707

New Member
Oct 7, 2024
8
0
1
Hey,
I upgraded all packages today (no-sub repos) and now the NIC seems to crash after some minutes after reboot.
Any hints what I could do?

Thanks

Code:
Apr 02 16:27:37 proxmox-node1 kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
  TDH                  <ac>
  TDT                  <c3>
  next_to_use          <c3>
  next_to_clean        <ab>
buffer_info[next_to_clean]:
  time_stamp           <100212189>
  next_to_watch        <ac>
  jiffies              <100aaf481>
  next_to_watch.status <0>
MAC Status             <80083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7c00>
PHY Extended Status    <3000>
PCI Status             <10>


Code:
CPU(s) 8 x 12th Gen Intel(R) Core(TM) i3-12100 (1 Socket)
Kernel Version Linux 6.8.12-9-pve (2025-03-16T19:18Z)
Boot Mode EFI
Manager Version pve-manager/8.3.5/dac3aa88bac3f300
 
Last edited:
Try updating again
apt-get update && apt-get upgrade

Reinstall the driver or update
apt reinstall pve-firmware

Check that you have the correct file with the Debian and Proxmox version update package.

rm /etc/apt/sources.list.d/pve-enterprise.list
echo "deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription" > /etc/apt/sources.list.d/pve-no-subscription.list
 
thank
let's wait and see if reinstall helped


Code:
root@proxmox-node1:~# apt-get update && apt-get upgrade
Hit:1 http://security.debian.org/debian-security bookworm-security InRelease
Hit:2 http://deb.debian.org/debian bookworm InRelease
Hit:3 http://download.proxmox.com/debian/pve bookworm InRelease
Get:4 http://deb.debian.org/debian bookworm-updates InRelease [55.4 kB]
Fetched 55.4 kB in 0s (151 kB/s)   
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
The following packages were automatically installed and are no longer required:
  proxmox-kernel-6.8.12-2-pve-signed proxmox-kernel-6.8.12-4-pve-signed proxmox-kernel-6.8.12-5-pve-signed
Use 'apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
root@proxmox-node1:~# apt reinstall pve-firmware
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following packages were automatically installed and are no longer required:
  proxmox-kernel-6.8.12-2-pve-signed proxmox-kernel-6.8.12-4-pve-signed proxmox-kernel-6.8.12-5-pve-signed
Use 'apt autoremove' to remove them.
0 upgraded, 0 newly installed, 1 reinstalled, 0 to remove and 0 not upgraded.
Need to get 0 B/159 MB of archives.
After this operation, 0 B of additional disk space will be used.
(Reading database ... 84989 files and directories currently installed.)
Preparing to unpack .../pve-firmware_3.15-2_all.deb ...
Unpacking pve-firmware (3.15-2) over (3.15-2) ...
Setting up pve-firmware (3.15-2) ...
root@proxmox-node1:~#
 
I pinned the kernel for 1 boot that is said to still have not that issue in above link
lets see ...

Code:
root@proxmox-node1:~# proxmox-boot-tool kernel pin 6.8.12-8-pve --next-boot
 
the kernel 6.8.12-8-pve is stable and no hardware hang.
liked described in above bugzilla ticket
 
I also have this issue. I have four mini pcs in a cluster and after the upgrade they each restart with the hang at least once a day. I have pinned 6.8.12-8-pve and this appears to have resolved the issue.

I want to update to the recent 8.4.0 release, but want to keep things stable for a bit. I also don't want to pin kernels so I can keep getting security updates.

Any ideas if this is something the pve team is looking at and if there will be an update released for it?
 
Hey, I think I've figured this one out! Maybe.

I have a (pretty old) Intel gigabit ethernet adapter in my Proxmox server. According to lshw it's a 82571.

I recently did a (long overdue) upgrade from Proxmox 7.4 (Debian bullseye) to 8.4 (Debian bookworm) and almost immediately started getting "Hardware Unit Hang" driver errors like shorty707 is having. About once a day it would do this (and go offline, of course) but return to normal after a reboot, or sometimes even by itself after a few hours.

A grizzled admin I know told me that this is a known issue with these Intel chipsets and it's the TSO (TCP Segmentation Offloading, apparently) feature that causes this. He's had to manually disable TSO on servers in the past. Something about Intel trying to cheat on performance benchmarks.

So here's the magic command to disable TSO (replace the adapter ID as necessary, of course):
ethtool -K enp2s0f0 tso off

This change doesn't persist between reboots. I'm sure there's a proper way to do this, but I did it the lazy way by adding to my crontab:
@reboot /usr/sbin/ethtool -K enp2s0f0 tso off

It's been several days since I did this and it hasn't died on me yet, so it seems this was the fix.
 
Hey, I think I've figured this one out! Maybe.

I have a (pretty old) Intel gigabit ethernet adapter in my Proxmox server. According to lshw it's a 82571.

I recently did a (long overdue) upgrade from Proxmox 7.4 (Debian bullseye) to 8.4 (Debian bookworm) and almost immediately started getting "Hardware Unit Hang" driver errors like shorty707 is having. About once a day it would do this (and go offline, of course) but return to normal after a reboot, or sometimes even by itself after a few hours.

A grizzled admin I know told me that this is a known issue with these Intel chipsets and it's the TSO (TCP Segmentation Offloading, apparently) feature that causes this. He's had to manually disable TSO on servers in the past. Something about Intel trying to cheat on performance benchmarks.

So here's the magic command to disable TSO (replace the adapter ID as necessary, of course):
ethtool -K enp2s0f0 tso off

This change doesn't persist between reboots. I'm sure there's a proper way to do this, but I did it the lazy way by adding to my crontab:
@reboot /usr/sbin/ethtool -K enp2s0f0 tso off

It's been several days since I did this and it hasn't died on me yet, so it seems this was the fix.

How is it working for you @neckro ? I had this yesterday (the day I had to go out for the weekend :( ) and after a while I found
Code:
https://first2host.co.uk/blog/how-to-fix-proxmox-detected-hardware-unit-hang/
where it recommended to use this on interfaces

Code:
auto eno1
iface eno1 inet static
  address 5.5.5.555
  netmask 255.255.255.224
  gateway 5.5.5.55
  post-up /sbin/ethtool -K eno1 tso off gso off

I did that yesterday and today I can not ping my Proxmox box... Maybe a cronjob works better. Or maybe I need to use an USB NIC :confused:

## EDIT BEFORE THE MOD APPROVAL

So, I got my brother to go into my home and do a hard restart. I checked the status after restart and looks like the configuration is set as required, this is the status of my card

Bash:
Features for eno1:
rx-checksumming: on
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
        tx-tcp-segmentation: off
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: off
generic-segmentation-offload: off
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
tx-gso-list: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]
rx-udp-gro-forwarding: off
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]

So in my case, turning off that did not worked.

I will try setting the kernel but I would like to keep things up-to-date :(
 
Last edited: