[SOLVED] Intel NIC e1000e hardware unit hang

^ Out of curiousity neponn, can you post your Hardware Unit Hang error from today and see if it's any different than before?

I'm not seeing any documentation about the MAC or PHY statuses, but with enough information it might help lead us down the right path. I'm wondering if even one bit in the MAC or PHY status can indicate which settings need to change.
Sorry for the delay.... here is the second bout of error messages - looks the same to me:

Code:
Oct 15 05:39:44 proxmox kernel: e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
                                  TDH                  <ee>
                                  TDT                  <cd>
                                  next_to_use          <cd>
                                  next_to_clean        <ed>
                                buffer_info[next_to_clean]:
                                  time_stamp           <107b87fd0>
                                  next_to_watch        <ee>
                                  jiffies              <107b88b00>
                                  next_to_watch.status <0>
                                MAC Status             <40080083>
                                PHY Status             <796d>
                                PHY 1000BASE-T Status  <3c00>
                                PHY Extended Status    <3000>
                                PCI Status             <10>

Great to hear that you have had success in changing to the e1000e driver. I found that using tso off gso off gro off (still with virtio drivers) gave me a few more days stability (and no hangs). But then I chickened out and switched to the Realtek RTL8153 based USB adapter I mentioned above. Performs just as well as the internal NIC, and - so far - no hangs.
 
Humph. Another hang. Once gain, not much network activity at the time. So tso off isn't enough. Now trying tso off gso off gro off. USB ethernet adapter on the way (Axagon ADE-SR).

Out of curiosity - has anyone found a way to reset the hung e1000 controller without rebooting the system? Wondering whether a supervisor script could be developed that detected a hang and took action to reinstate the controller? Given how infrequently hangs occur, this could be workable...
I managed to reset the hung controller by just unplugging ethernet cable and replug after a few seconds. The network comes back up after another few seconds.

Just happened today and last happened 5 days ago. Going to try a few workarounds suggested here and observe.
 
  • Like
Reactions: mr.hollywood
Just a tip, you only need tso off.
There's no need to turn off all others.
I've been using this on /etc/network/interfaces for over 2 years with no issues:

Code:
post-up ethtool -K eno1 tso off
Code:
post-up ethtool -K vmbr0 tso off
Glad that's been working for you, just had a similar scenario in my lab, I've rebuilt my main node and had the issue, and completely forgot about this bug, hope it's going to get me around this stupid issue.

Thank you!
 
Just came here to say that the following fixed my e1000e intel nic hanging issue. Both of the following have successfully worked around this stupid issue:

* Switch my heavy usage VM's to use the emulated e1000e NIC at the VM level, instead of virtio (limited testing, but worked for about 24 hours, where previous crashes were every ~hour). Choose your VM > Hardware > Network Devices > Go to "Model" drop down > Select e1000e
-OR-
* disable tcp-segmentation-offload on the physical intel NIC's (I did not do the vmbr interface): (This is what I stuck with, not the VM driver change above)

Here is what my /etc/network/interfaces file looks like:
Code:
iface nic0 inet manual

        post-up ethtool -K nic0 tso off


iface enp2s0 inet manual

        post-up ethtool -K enp2s0 tso off


<iface vmbr0 unmodified>

To confirm the workaround is in place, this should display off, like in this example:
Code:
root@hostXX:~# ethtool -k nic0 | grep tcp-segmentation-offload
tcp-segmentation-offload: off
root@hostXX:~#

This has been working for me for 48 hours, but I will report back if it stops working, and I need to make any other changes.

I'm using proxmox 9.1.1 on a Dell inspiron 7010 MFF with i5-13500T processor, with a single onboard gigabit NIC (nic0), and an add on "youyeetoo" M.2 AE Key. which hadn't actually had any problems, but I went ahead and applied the workaround anyway, since it uses the same driver, and I was frustrated.

I saw a comment earlier about someone only noticing it on NIC's which share connectivity with an AMT KVM system. My inspiron nic0 does indeed have AMT KVM configured. So at some point if I feel like experimenting I might re enable TCP segmentation offload on my M2 AE card and see if the problem surfaces for that M2 "intel" NIC under load. and maybe if I get really curious, I may disable my AMT KVM on nic0 and see if the problem resurfaces there. But probably not. So hopefully someone else feels like experimenting with AMT KVM enabled vs disabled, and they can report back :)
 
Last edited: