e1000 driver hang

Here to say I'm joining the club, unfortunately.

This is literally my first Proxmox install (pve-manager/8.4.0/ec58e45e1bcdf2ac (running kernel: 6.8.12-9-pve)) and I only have remote access to my device (HP EliteDesk 800 G4 mini) currently. So I'm not going to try anything wild like upgrading etc at the risk of being locked out again and having to bug someone to go perform a hard reboot on site.

For those interested, my journey to this thread: https://chatgpt.com/share/68441cc7-9328-800f-9d2e-62c23765e509.
ChatGPT suggested a small script that runs using systemd to monitor the logs for this error, which will then simply perform a full reboot of the device.
It'll have to do for now and hopefully it works...

Maybe it can help someone else in a similar situation.
 
I have network problems since I recently switched from a very old Proxmox to v8.4. With unchanged hardware, serious malfunctions occur after hours or 1-2 days. An old Intel Nic hangs and apparently also interferes with the communication that does not concern him. For example, an internal ping from VM to VM is no longer possible for which the hanging NIC is not required.
I try to help myself with workaround (ethtool -K ens1f1 gso off gro off [...]) but this cannot be a solution. Did you resolve the issue with purchasing a new NIC? Which ones are recommended?

THX
 
Please check the syslog. If there are no "Detected Hardware Unit Hang" error messages in it, then u have another problem and u should open an own specific thread. This one here is about the "e1000 driver hang" problem only. The problem seems to relate to newer kernel versions than 6.8.12-8-pve. With this kernel everything works fine. I hope this gets fixed in future kernel versions.
 
Downgraded from 6.8.12-11-pve to 6.8.12-8-pve. While it does "fixes" it self the error happens every once in a while but fixes it self unlike in 6.8.12-11-pve, where it gets stuck on Unit Hang error until restarting the server.
 
I have been battling this problem for some time. I too was getting the network hung issue with kernel versions newer than 6.8.12-8-pve.

I also found that a restart was not necessary to get things going again. I was able to get the network working again by unplugging / plugging the network cable or disabling / enabling the network port remotely in my managed switch.

While I didn't notice any network problems while on kernel 6.8.12-8-pve or earlier, I did notice frequent messages on the console like the following:
[102464.520216] e1000e 0000:00:19.0 eno1: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 7168 ms

I tried disabling the offloading features as recommended in other posts (while on kernel 6.8.12-8-pve) and no longer got these errors. After 3 days of no errors I made the changes persistent and upgraded to 6.8.12-11-pve. It has been 3 days now and I have experienced no network problems!

So It seems there were network problems showing up with earlier kernels as well, only the way they are handled has changed.

I simply added "pre-up ethtool -K eno1 gso off gro off tso off" to interfaces after "iface eno1 inet manual" in etc/network to make it always run on startup.
 
  • Like
Reactions: TomFIT and MarkusKo
This is not a valid solution for me, sure, the network does not hang with the offloads turned off, but it drops to FE (100Mbit)
Anyone else seen this?
 
Last edited: