e1000 driver hang

Here to say I'm joining the club, unfortunately.

This is literally my first Proxmox install (pve-manager/8.4.0/ec58e45e1bcdf2ac (running kernel: 6.8.12-9-pve)) and I only have remote access to my device (HP EliteDesk 800 G4 mini) currently. So I'm not going to try anything wild like upgrading etc at the risk of being locked out again and having to bug someone to go perform a hard reboot on site.

For those interested, my journey to this thread: https://chatgpt.com/share/68441cc7-9328-800f-9d2e-62c23765e509.
ChatGPT suggested a small script that runs using systemd to monitor the logs for this error, which will then simply perform a full reboot of the device.
It'll have to do for now and hopefully it works...

Maybe it can help someone else in a similar situation.
 
I have network problems since I recently switched from a very old Proxmox to v8.4. With unchanged hardware, serious malfunctions occur after hours or 1-2 days. An old Intel Nic hangs and apparently also interferes with the communication that does not concern him. For example, an internal ping from VM to VM is no longer possible for which the hanging NIC is not required.
I try to help myself with workaround (ethtool -K ens1f1 gso off gro off [...]) but this cannot be a solution. Did you resolve the issue with purchasing a new NIC? Which ones are recommended?

THX
 
Please check the syslog. If there are no "Detected Hardware Unit Hang" error messages in it, then u have another problem and u should open an own specific thread. This one here is about the "e1000 driver hang" problem only. The problem seems to relate to newer kernel versions than 6.8.12-8-pve. With this kernel everything works fine. I hope this gets fixed in future kernel versions.
 
Downgraded from 6.8.12-11-pve to 6.8.12-8-pve. While it does "fixes" it self the error happens every once in a while but fixes it self unlike in 6.8.12-11-pve, where it gets stuck on Unit Hang error until restarting the server.
 
I have been battling this problem for some time. I too was getting the network hung issue with kernel versions newer than 6.8.12-8-pve.

I also found that a restart was not necessary to get things going again. I was able to get the network working again by unplugging / plugging the network cable or disabling / enabling the network port remotely in my managed switch.

While I didn't notice any network problems while on kernel 6.8.12-8-pve or earlier, I did notice frequent messages on the console like the following:
[102464.520216] e1000e 0000:00:19.0 eno1: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 7168 ms

I tried disabling the offloading features as recommended in other posts (while on kernel 6.8.12-8-pve) and no longer got these errors. After 3 days of no errors I made the changes persistent and upgraded to 6.8.12-11-pve. It has been 3 days now and I have experienced no network problems!

So It seems there were network problems showing up with earlier kernels as well, only the way they are handled has changed.

I simply added "pre-up ethtool -K eno1 gso off gro off tso off" to interfaces after "iface eno1 inet manual" in etc/network to make it always run on startup.
 
This is not a valid solution for me, sure, the network does not hang with the offloads turned off, but it drops to FE (100Mbit)
Anyone else seen this?
 
For anyone like me who finds this thread and can't get the community script, it's been moved to a URL which is almost but not quite exactly the same as the one posted several times. I'll try to paste it here, but in case this forum restricts links from new users you can take the link that's been posted several times and remove the capital D.

https://community-scripts.github.io/ProxmoxVE/scripts?id=nic-offloading-fix
 
For anyone like me who finds this thread and can't get the community script, it's been moved to a URL which is almost but not quite exactly the same as the one posted several times. I'll try to paste it here, but in case this forum restricts links from new users you can take the link that's been posted several times and remove the capital D.

https://community-scripts.github.io/ProxmoxVE/scripts?id=nic-offloading-fix

Just installed, let see if it fixes my issue.

Thank you!
 
Experienced this issue for the first time on a Dell Optiplex 7070 SFF with I219-LM NIC on both kernels 6.8.12-11 and 6.8.12-12, Never had the issue until these kernels were pushed out to the repository. Installed the systemd service workaround, will report back if it happens again.
 
Just an update about my nodes with "Hardware Unit Hangs":

Rolling back to kernel 6.8.12-8-pve did not solve my problems. After a six weeks without any hardware unit hang they started again.
At the moment I have tree proxmox nodes with this network problem.
Two weeks ago I updated one node (after 12960 hangs in one day) from 6.8.12-8-pve to 6.8.12-11-pve and switched of the offloading for my NIC enp0s31f6 in the interfaces with

Code:
iface enp0s31f6 inet manual
        post-up /usr/sbin/ethtool -K enp0s31f6 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off
No Hardware Unit Hangs since this for two weeks now. I will alter now the other two problematic proxmox nodes the same way.

Further Hardware information:
all tree problematic nodes are Fujitsu Workstations/PC (not a server hardware)
the two workstations "CELSIUS J5010" have the following NIC
Code:
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (11) I219-LM
        Subsystem: Fujitsu Technology Solutions Ethernet Connection (11) I219-LM
the one PC has the following NIC
Code:
00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I217-V (rev 04)
        Subsystem: Fujitsu Technology Solutions Ethernet Connection I217-V
 
I don't want to jinx it, but this seems to have made a difference - I had been crashing every 1-3 hours previously and I have now been up for 12 hours (and have even been able to increase the cores given to the VM) straight.View attachment 86016
If this is the impact of disabling offloading it looks pretty bad. It chewed up half the CPU capacity!

As others were mentioning before, disabling offloading is not the solutio, just a very inefficient workaround.