e1000 driver hang

For anyone like me who finds this thread and can't get the community script, it's been moved to a URL which is almost but not quite exactly the same as the one posted several times. I'll try to paste it here, but in case this forum restricts links from new users you can take the link that's been posted several times and remove the capital D.

https://community-scripts.github.io/ProxmoxVE/scripts?id=nic-offloading-fix
I came across this problem (again) today and a google search brought me to a post in this thread, ironically one I posted myself a few months ago - I had forgotten I had had this problem before, on another machine!

Thanks for pointing out the correct script.

Unfortunately running it a second time doesn't seem to toggle the feature back off, but since it's been going on for years I suspect that won't matter.
 
Is there a chance to open a support ticket @ Proxmox? Because it seems noone is working on this issue... I thought this would be a bug in a single kernel version, but there were many kernels released since 6.8.12-8-pve and ALL of them have this problem. Upgrading to VE 9 is no solution too.
 
Last edited:
Happy to report after updating today . all is well .

Screenshot 2025-08-28 at 10.05.39 PM.png
This log goes back to 13th August 20250. It was hanging every .02 seconds.
My system wash super slow . and memory usage would climb until it crashed the system.



Code:
journalctl | grep "Detected Hardware Unit Hang"
Aug 28 19:52:34 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:52:36 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:52:38 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:52:40 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:52:42 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:52:44 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:52:46 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:52:48 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:52:50 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:52:52 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:52:54 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:52:56 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:52:58 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:53:00 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:53:02 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:53:04 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:53:06 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:53:08 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:53:10 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:53:12 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
Aug 28 19:53:14 pve kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:


Screenshot 2025-08-28 at 9.19.03 PM.jpgScreenshot 2025-08-28 at 9.23.23 PM.jpg
 
I'm running PVE 9, 9.0.6 with I believe kernel 6.14.8-2-pve, and this bit me last night.

Previous (PVE 8), i'd have 4 to 8 of the hang messages and then all would be fine for 4 to 6 months. I think right after a reboot. All of a sudden last night they started spewing the hang messages and the nic port went dead. Everything was still running, only the onboard nic was offline. The other nic (4 port broadcom, pci pass thru to a router/firewall vm) was fine.

After reboot things started working again, though it was still complaining about the e1000e hang.

I noticed a kernel update, so I did an upgrade and reboot. After rebooting the message has not (yet?) reappeared.

It looks like 6.14.11-1-pve may have some improvements, at least it hasn't given the message since rebooting.
 
I am still having the issue (sudden dead network connection/host unreachable with e1000e Hardware Unit Hang kernel messages in log) with 6.14.11-1-pve - the frequency varies by host, on some it's very rare, on others it happened multiple times per day. I can always force a reset from the switch side by just disabling the port, and re-enabling it. Then everything works fine again, no restart required on the host. Today one of my hosts did it in the middle of a VM migration, and after I cycled the port from the switch side, the migration even resumed and completed successfully.

I have been adding pre-up ethtool -K eno1 gso off gro off tso off to /etc/network/interfaces on the affected hosts, and so far this seems to stop the hangs.
 
Wow! No Shit? U wrote this post to publish a dirty workaround which was posted here 10.000 times already?
 
Wow! No Shit? U wrote this post to publish a dirty workaround which was posted here 10.000 times already?
No ? My point was that the latest kernel did not fix the issue (as speculated by the previous poster) - I added one sentence about the workaround to confirm that that still works on the latest kernel. I am not happy with the workaround, but it's the best I can do at this time (would prefer not getting a new NIC or downgrading the kernel back)
 
That's actually a pretty bad workaround as CPU and bandwidth took a hit. I've lost hope and migrated one of my host from Proxmox. I'm pretty sure this has been asked for a million times but anyone know if this is on Proxmox fix list? If not, what's the proper channel to raise this issue to Proxmox side?
 
CPU and bandwidth took a hit

I actually have not observed any measurable hit to my cpu utilization or bandwidth after disabling GSO/GRO/TSO - but I only use 1G networking and relatively modern CPUs (Intel 9gen +) Reading up on what GSO, GRO and TSO does, I don't think any of the benefits would apply to any of my specific use cases (and I think this is probably true for a lot of users)

What is the situation when you observe CPU and bandwidth impacts from disabling the offloading ? What kind of use case ?

(And before people start jumping on me again - I do think the current situation with the default configuration of proxmox leading to network hangs on common Intel NICs is a problem that should be addressed - I don't work for proxmox and use their software for free, so I can't help with that unfortunately - I am just sharing my observations as data points to help other users in the same boat, trying to explore whatever options are currently available to deal with this problem)
 
Proxmox can't address the issue, only Intel can fix their driver of their hardware in the upstream linux vanilla kernel.

While I understand that they are at the mercy of upstream kernel issues, I do think Proxmox could do a few things to make this situation better in the interim ("address" encompasses other possibilities than just "fix"). Apparently an older version of the kernel (and presumably the driver) handled the hangs better (without causing the network interface to freeze) - since Proxmox builds their custom kernel, could they chose to include the older version of the driver ? I don't know enough about other dependencies here to say if this is feasible or not. If this is not an option, Proxmox could default the offloading flags to off during installation for e1000e nics and/or raise warnings about this issue during installation when a e1000e nic is detected. I think in the absence of a working driver from Intel, having this offloading turned off for default e1000e installation (with a warning perhaps for those who might require offloading to look for a NIC with better Linux support) would be preferable to having unexpected and unpredictable freezes as is currently the case.

In my situation, I had a working Proxmox cluster for years with a mix of nics, including Intel e1000e ones, with no apparent issues starting from PVE 6 through 7 and 8. Everything still seemed fine after upgrading to PVE 9 a month ago. Then I upgraded my (old second-hand) switch firmware, and that night got the first hang. Then the hangs would happen randomly but infrequently over the next few weeks - so my first thought unfortunately went to the switch firmware update, and I started investigating there - I tried different switch firmware versions, ports, etc. and almost got resigned to having to replace it, as I was not able to get the hangs to stop. Wasted a lot of time before realizing the issue is coming from Proxmox side.

But that's just my 2¢
 
I actually have not observed any measurable hit to my cpu utilization or bandwidth after disabling GSO/GRO/TSO - but I only use 1G networking and relatively modern CPUs (Intel 9gen +) Reading up on what GSO, GRO and TSO does, I don't think any of the benefits would apply to any of my specific use cases (and I think this is probably true for a lot of users)

What is the situation when you observe CPU and bandwidth impacts from disabling the offloading ? What kind of use case ?

(And before people start jumping on me again - I do think the current situation with the default configuration of proxmox leading to network hangs on common Intel NICs is a problem that should be addressed - I don't work for proxmox and use their software for free, so I can't help with that unfortunately - I am just sharing my observations as data points to help other users in the same boat, trying to explore whatever options are currently available to deal with this problem)
I also have 9th gen intel (i5-9600) and it's used for 24/7 recording at 1080p for 13 cameras and Live streaming 24/7 at 480p using Frigate. CPU usage is already high to begin with and with offloading it's even higher.

Proxmox can't address the issue, only Intel can fix their driver of their hardware in the upstream linux vanilla kernel.

Here are some upstream bug reports for your leisure :eek:
There is even one from 2015, kernel 5.2.x
I don't have enough knowledge to assume things but I have the exact same machine running older version (I believe 6.14.8-2-pve kernel) and it never had any issue (uptime for at least a year now) so I believe Proxmox could've handle this without having to rely on Intel, however rolling back to older kernel from later version seems to not work for me, it will still hang sometimes. You can see the user above mentioned that it's also working fine for their older version until they update to later version.
 
I also have 9th gen intel (i5-9600) and it's used for 24/7 recording at 1080p for 13 cameras and Live streaming 24/7 at 480p using Frigate. CPU usage is already high to begin with and with offloading it's even higher.
Out of curiosity, did you turn all offloading off (I think the script posted above does that, turns off checksum offloading and VLAN offloading as well) or just GSO/GRO/TSO ? There are mixed recommendations for this - so far in my case, just disabling GSO/GRO/TSO seems to have eliminated the hangs, but I kept the rx/tx and vlan offloading on.

I would think GSO/GRO/TSO should not really matter for your use case (multiple 1080p streams) - especially if using UDP, but even RTSP over TCP uses small packets usually. I could be wrong of course, and don't know your setup - but could be worth testing.
 
Out of curiosity, did you turn all offloading off (I think the script posted above does that, turns off checksum offloading and VLAN offloading as well) or just GSO/GRO/TSO ? There are mixed recommendations for this - so far in my case, just disabling GSO/GRO/TSO seems to have eliminated the hangs, but I kept the rx/tx and vlan offloading on.

I would think GSO/GRO/TSO should not really matter for your use case (multiple 1080p streams) - especially if using UDP, but even RTSP over TCP uses small packets usually. I could be wrong of course, and don't know your setup - but could be worth testing.
Everything is turned off, I also tried the e1000 community script and with same result. It might not happen as often before turning it off but it will still happen from time to time. Migrated to Debian and it's been good ever since.
 
So... without spending 2 days reading in to all of this...

The errors with the E1000 driver hangs, I may have notice a few hang ups here and there over the past 2 years of a server running this driver, but after the last update it was happening daily.

Does the node use this driver? Or is it just the VM's that use it? I changed the network driver on the VM's that were set to E1000 to VirtIO, and I haven't seen any hardware hangs since then. It's been a few hours. We shall see.
 
So... without spending 2 days reading in to all of this...

The errors with the E1000 driver hangs, I may have notice a few hang ups here and there over the past 2 years of a server running this driver, but after the last update it was happening daily.

Does the node use this driver? Or is it just the VM's that use it? I changed the network driver on the VM's that were set to E1000 to VirtIO, and I haven't seen any hardware hangs since then. It's been a few hours. We shall see.

on my lxc i use VirtIO and the hardware hangs...

to test the system after a kernel upgrade i use an ubuntu LXC and in therminal i use the command "speedtest-cli --secure" (just to trigger the bug), usually after launching the command 2-3 times hardware hungs with this bug (so i don't have to attend the hardware hangs randomly)
 
Does anyone know why this happened, both initially and why it reared it's head again and again? Not blaming Proxmox per se--it was clearly either fixed or otherwise worked around--but it's also clearly either regressed, resurfaced or simply not been resolved, and I do genuinely want to learn and understand why my system worked fine one update and has this bug now and what it actually does/is. I get the discussion in this thread, about it being to do with hardware offloading, but to me I can't quite understand why it was fine and now isn't when the workload going through the NIC hasn't changed. (yes, obviously updates change things, including the OS running on that hardware, of course, but that's what I'm saying - I'd like to learn!)