e1000 driver hang

Adding another datapoint.

Dell T5810 with a

00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I217-LM (rev 05)

Been fine for more than a year, with Linux version 6.8.4-2-pve but upgraded mid-March, and ~9 days later had the first e1001e Detected Hardware Unit Hang error - that is with Linux version 6.8.12-20-pve

I didn't investigate it at the time, just assumed a random crash, but then again another ~ 9 days later and same thing.

I've setup a cron entry - but it's pretty ugly - just tries to ping my router, and if that fails twice in a row at 15m interval, force a reboot:

Code:
*/15 * * * * root ping -c3 -W2 192.168.1.1 > /dev/null 2>&1 && rm -f /run/ping-watchdog-fail || { [ -f /run/ping-watchdog-fail ] && /usr/sbin/shutdown -r now || touch /run/ping-watchdog-fail; }

This combined with proxmox-boot-tool kernel pin 6.8.4-2-pve means I should only have one more failure with 6.8.12-* branch.

At some point I'll have to revisit newer proxmox kernel releases to see if the regression's been properly fixed.
 
Post update:
Upgraded to Proxmox 9 and still NiCS crashes. This time with hardware failure in log. Adding post-up to interface in /etc/network/interfaces -suggested already in this thread- seems to work for me:
Code:
iface nic0 inet manual
        post-up /sbin/ethtool -K nic0 gso off gro off tso off

Original post:
I have two Intel NiCS on the following Z390M-ITX/ac mainboard:
- Gigabit LAN 10/100/1000 Mb/s
- 1 x Giga PHY Intel® I219V, 1 x GigaLAN Intel® I211AT
Tested confirmed that both crash under load. Crashing occured after upgrading to latest Proxmox 8 kernel. Before this kernel sometimes remote connection problems which I assume now after words already had to do with this issue. Would upgrading to Proxmox 9 help?
Above config kept the NIC crashes away for a couple of months, but they came back with no trace in the log. After adding 'tx off' and 'rx off' to the config as well it seems things are stable again. For easy reference:
Code:
iface nic0 inet manual
        post-up /sbin/ethtool -K nic0 gso off gro off tso off tx off rx off
 
So what's the current status on this? I had a bit of drama this morning because of another crash due to "kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang". One of my nodes experiences this roughly every week.

I'm still on 8.4, are there any fixes/workarounds for the network driver crash? If there aren't, proxmox 9 doesn't seem to be fixed either. I was looking for a good reason to finally migrate to 9, but I'm afraid this isn't it.
 
@kokoon what is your exact current running kernel version? I do ask, as here I saw similar issues with unstable network and SAS since few days, after updating from `6.8.12-22-pve` to `6.8.12-28-pve` on several nodes. We're still in the middle of investigation, which means this is just a guess so far that the issues might be related to a recent kernel update.
 
Since this problem exists now for many years, I expect no fix for the Hardware Unit Hang on e1000 driver anymore.
I configured all nodes with this problem not to use the offloading feature in the network interfaces file.
 
@kokoon what is your exact current running kernel version? I do ask, as here I saw similar issues with unstable network and SAS since few days, after updating from `6.8.12-22-pve` to `6.8.12-28-pve` on several nodes. We're still in the middle of investigation, which means this is just a guess so far that the issues might be related to a recent kernel update.
6.8.12-29-pve
 
Since this problem exists now for many years, I expect no fix for the Hardware Unit Hang on e1000 driver anymore.
I configured all nodes with this problem not to use the offloading feature in the network interfaces file.
So disabling offloading is a definite workaround for it, I gather. That's good enough for me, just a homelab in question here.