[SOLVED] Intel NIC e1000e hardware unit hang

So I'm planning to rebuild my Home Assistant server over the weekend, using Proxmox so that I can build HAOS in a VM and be considered "supported" by HA (today I run HA supervised, but it's fallen out of favour for prod builds). My HA server is an Intel NUC (NUC8I3BEH 8th Gen Core i3). The NIC shows up as:

Code:
% lspci -v | grep Ethernet
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (6) I219-V (rev 30)
    Subsystem: Intel Corporation Ethernet Connection (6) I219-V

I assume, therefore, that I am going to join the club of e1000e driver hangs, which doesn't fill me with joy (the NIC has been 100% stable running bare metal for the last 6 years).

My question: can I avoid this grief if I use the VirtIO network driver instead of Intel E1000? I'm not doing anything fancy with networking - no VLANs or weird IP config. I'll just have separate IP addresses for the host and each of the VM guests. I do have Gbps fibre from my ISP, so would like to maintain full Gbps performance.
That is a very good question, and my testing so far indicates the wisdom that led me to consider this is backwards. Assigning the VM to use an e1000e nic made all the problems go away for me. I'll do further testing later. But using the VirtIO option seems to me to be what was causing the crashes. Although I did disable TSO as well, curious if you would experience no problems just using the e1000e nic without doing any further changes. Please let us know!
 
OK - that's super interesting. And I've twigged that perhaps I've missed the point a bit. I was thinking of this as a problem that materialised on VM guests, but now I'm thinking the workaround (disabling GSO / TSO) is probably applied to the host, not the guests. This means the problem must be a complex interaction between config of the interface on the host and network driver chosen for the guests. Ouch.

In any case, I'll certainly share my experience. Although I have to confess I am considering alternatives to Proxmox, if it means I can avoid this issue. It occurs to me that Proxmox is arguably far more capable than I need (two VM guests with super-vanilla config).
 
OK - that's super interesting. And I've twigged that perhaps I've missed the point a bit. I was thinking of this as a problem that materialised on VM guests, but now I'm thinking the workaround (disabling GSO / TSO) is probably applied to the host, not the guests. This means the problem must be a complex interaction between config of the interface on the host and network driver chosen for the guests. Ouch.

In any case, I'll certainly share my experience. Although I have to confess I am considering alternatives to Proxmox, if it means I can avoid this issue. It occurs to me that Proxmox is arguably far more capable than I need (two VM guests with super-vanilla config).
Don't let me talk you out of it. Two hours of iperf3 so far without issues. And just now, I edited the config file to be as the last poster suggested, or atleast similar, and I'm going to test it again. I'm now testing:


Code:
auto lo
iface lo inet loopback

iface eno1 inet manual
    # Disable offloads for e1000e driver issues
    # post-up /sbin/ethtool -K eno1 tso off gso off gro off
    post-up ethtool -K eno1 tso off

auto vmbr0
iface vmbr0 inet static
    address 10.10.10.1/24
    gateway 192.168.1.1
    bridge-ports eno1
    bridge-stp off
    bridge-fd 0
    post-up ethtool -K vmbr0 tso off

# iface wls3f3 inet manual
# Wireless interface (Wi-Fi client)
auto wls3f3
iface wls3f3 inet dhcp
    wpa-ssid "backup"
    wpa-psk "backup"


source /etc/network/interfaces.d/*

and I'm not experiencing the same issues so far. I'll let it run Iperf3 overnight and see what happens.
 
I am stable with all updates installed. At least for the moment. I am using the offloading fix.
But to be sure, I bought two of these gadgets: https://store-eu.gl-inet.com/products/comet-gl-rm1-remote-keyboard-video-mouse
(They have a global and a US store)
Nice but can we be certain there is no hidden backdoor (even without using the cloud option)? Is the software open source and can it be checked by a community and compiled yourself? If I would use this, I would lock it down extremely tight. No traffic to the internet, for instance. I would make sure on my router/fw/gateway it only reacts to a few internal IP addresses, including a few that get handed out when I connect with VPN to my network (for remote access). Opening up physical access to the outside world is a really, really, REALLY, creepy thing, however nice it is.

And even locked down, what if I connect using a web interface and that web interface that is running on the client machine connects to the outside world as well through whatever is on the web page getting around my nicely locked down setup...?
 
Last edited:
Just to update you guys.

  1. Ubuntu VM with an e1000e nic successfully ran IPerf3 for 10 hours at gigabit speeds with 0 retransmissions and no crashing.
  2. Windows 10 VM with a VirtIO nic has been running IPerf3 for 8 hours at gigabit speeds. No stats until it's done.
After this VM finishes up this Iperf3 10 hour test, I'll do another test with multiple VMs running Iperf3 at the same time, as well as the proxmox host. My best settings so far are the ones I last posted in this thread, with whatever miscellaneous changes I may have made elsewhere.

If you experienced this issue before, other than just waiting and seeing if the problem was fixed, would you recommend any tests or benchmarks that would crash a nic with the hardware hang? How can I test it before I put it in a production environment, so to speak? It seems the windows 10 VM running a VirtIO nic and iperf3 was reliably able to make it crash previously, but after the changes, so far, so good. I might call it done after a few more days of testing.
 
Nice but can we be certain there is no hidden backdoor (even without using the cloud option)? Is the software open source and can it be checked by a community and compiled yourself? If I would use this, I would lock it down extremely tight. No traffic to the internet, for instance. I would make sure on my router/fw/gateway it only reacts to a few internal IP addresses, including a few that get handed out when I connect with VPN to my network (for remote access). Opening up physical access to the outside world is a really, really, REALLY, creepy thing, however nice it is.
I would certainly lock it down HARD too.

And even locked down, what if I connect using a web interface and that web interface that is running on the client machine connects to the outside world as well through whatever is on the web page getting around my nicely locked down setup...?
What makes that different from all the existing Server IPMI's thats already on most decent server MoBo's ?
IMHO - Browser control is mostly on you.

That said:
I have been eyeballing this one a bit, and they just released a PoE model too.
But i think they're a bit pricey for their functionality.
The mentioned above, has a "small dataflash" ... Might not contain an install ISO.
The Poe model has a bigger "ok" dataflash, but price went high up ..... Considering we're just talking about a ~$6..10 increased hw cost.
 
Just to update you guys.

  1. Ubuntu VM with an e1000e nic successfully ran IPerf3 for 10 hours at gigabit speeds with 0 retransmissions and no crashing.
  2. Windows 10 VM with a VirtIO nic has been running IPerf3 for 8 hours at gigabit speeds. No stats until it's done.
After this VM finishes up this Iperf3 10 hour test, I'll do another test with multiple VMs running Iperf3 at the same time, as well as the proxmox host. My best settings so far are the ones I last posted in this thread, with whatever miscellaneous changes I may have made elsewhere.

If you experienced this issue before, other than just waiting and seeing if the problem was fixed, would you recommend any tests or benchmarks that would crash a nic with the hardware hang? How can I test it before I put it in a production environment, so to speak? It seems the windows 10 VM running a VirtIO nic and iperf3 was reliably able to make it crash previously, but after the changes, so far, so good. I might call it done after a few more days of testing.
Might running an iperf3 on PVE itself concurrently with VM(s) be a good test? This would cover the client VMs running network traffic concurrently with for instance running backups of clients on the host. Also, running iperf3 in both directions?

(I ran into this also quite suddenly earlier this year after using PVE since 2022 with a very simple setup: single PVE+PBS host, single Ubuntu VM (lshw says I am running the Virtio network device). I haven't had the problem anymore after turning off the hardware with "post-up /sbin/ethtool -K eno1 tso off gso off" on the PVE host)
 
Last edited:
  • Like
Reactions: mr.hollywood
Over the extended test with the windows VM iperf3 experienced an issue, but proxmox and the VMs are all still responsive.

On the win10 client with a VirtIO nic I got:
iperf3: error - control socket has closed unexpectedly

On the server, I got: (remember not the proxmox server, a different server in the network)
iperf3: error - select failed: Bad file descriptor

So far this has solved the issue for the most part. I'll keep trying to see if VirtIO can pass a 10+ hour iperf3 test. I might be sticking to e1000e nic assignments in proxmox, we'll see.
 
Update: Iperf3 completed on the Windows VM with a VirtIO nic. First failure must've been a fluke.

After the change, I have not been able to get it to fail as quickly or in the same way as before. I have not yet experienced a "hardware hang" on the proxmox host in the same way.

In the proxmox shell when I search:
Code:
journalctl -p err --since "2 days ago" -r | grep -Ei "hang|stuck|blocked|lockup|timeout"
I'm now greeted with no results over the last 2 days:
1760236709875.png

So, it's looking hopeful. Firing up multiple iperf3 instances across virtual machines now.