[SOLVED] Fixed: Intermittent network dropout with Intel I218-LM NIC -- e1000e hardware offloading bug

Borris1974

New Member
Mar 24, 2026
1
0
1
Hardware and Software
  • Proxmox VE 9.1.6
  • Fujitsu T935
  • Intel Corporation Ethernet Connection (3) I218-LM (rev 03)
  • Kernel driver: e1000e
  • Interface name: nic0 (yours may differ -- typically eno1 or enp3s0)
  • Guest VM: Any (the fault is at the host NIC level)
  • Router: Asus XT9 with DHCP reservation

Symptoms
  • Network connectivity to and from all VMs drops intermittently -- no pattern, could be hours or days between occurrences
  • The Proxmox host itself also loses connectivity during the dropout
  • The physical ethernet link light stays green -- the interface shows as UP in the OS
  • Running ip neigh show dev vmbr0 shows ARP entries going STALE or DELAY for the router
  • No errors visible in the Proxmox web UI
  • Running ping to the router or any external host fails silently
  • SSH sessions drop, Home Assistant becomes unreachable, all VM traffic stops

Temporary fix (what most people discover first)
Physically unplug and replug the ethernet cable. Connectivity restores within seconds.
This works because unplugging the cable forces a hardware reset of the NIC, clearing the hung state. It is not a fix -- the dropout will return.


Root cause
The Intel I218-LM NIC uses the e1000e Linux kernel driver. There is a well-documented bug where hardware offloading features cause the NIC to enter a silent hang state. The kernel logs this as:

e1000e 0000:00:19.0 nic0: Detected Hardware Unit Hang

Check for this after a dropout with:
dmesg | grep -i "hang\|e1000e" | tail -20

The NIC continues to report itself as UP and the link light stays on, which makes this extremely difficult to diagnose. The ARP table going stale is a symptom of the underlying NIC hang, not the root cause.

This affects multiple Intel NIC models using the e1000e driver, including I217-LM, I218-LM, I219-LM and I219-V. It is not specific to any particular VM or workload.


Permanent fix

Step 1 -- Identify your physical NIC name

lspci | grep -i ethernet
ip link show

Note the interface that is the bridge port for vmbr0.

Step 2 -- Check offloading is currently enabled (confirms the issue applies to you)
ethtool -k nic0 | grep -E 'tcp-seg|generic-seg|generic-receive|rx-vlan|tx-vlan|scatter'

If any entries show on, proceed.

Step 3 -- Disable offloading immediately (temporary, to test)
Replace nic0 with your interface name:

ethtool -K nic0 gso off tso off rxvlan off txvlan off gro off tx off rx off sg off

Step 4 -- Make it permanent
Edit /etc/network/interfaces:
nano /etc/network/interfaces

Add a post-up line to your physical NIC stanza. The post-up method is required -- the offload-* directives do not reliably apply on boot:
Code:
auto lo
iface lo inet loopback

iface nic0 inet manual
    post-up ethtool -K nic0 gso off tso off rxvlan off txvlan off gro off tx off rx off sg off

auto vmbr0
iface vmbr0 inet dhcp
    bridge-ports nic0
    bridge-stp off
    bridge-fd 0

Reload networking:
ifreload -a

Verify offloading is off after reload:
ethtool -k nic0 | grep -E 'tcp-seg|generic-seg|generic-receive|rx-vlan|tx-vlan|scatter'

All entries should show off.

Step 5 -- Fix invalid ARP responses (secondary fix)
Proxmox can also send ARP replies on the wrong interface, confusing the router. Prevent this permanently:
echo -e "net.ipv4.conf.all.arp_ignore=2\nnet.ipv4.conf.all.arp_announce=2" | tee /etc/sysctl.d/99-proxmox-arp.conf sysctl -p /etc/sysctl.d/99-proxmox-arp.conf


Result
No further network dropouts. The fix survives reboots. No performance impact was observed on a home lab running Home Assistant and other lightweight VMs.


Important AI tools warning: conflicting and wrong diagnoses -- use multiple tools and verify everything

This section is worth reading before you spend hours chasing the wrong fix.

When the symptoms of this fault were put into Microsoft Copilot, it was adamant that the router was the cause. It pointed to the DHCP reservation, the ASUS firmware, and ARP staleness on the router side as the fault. Even after being told that the problem was fixed with above, Copilot continued to insist the router was at fault and suggested router-side fixes.

This is a known risk with AI assistants - they can latch onto a plausible-sounding diagnosis early and then defend it even when contradicting evidence is provided. In networking faults especially, where symptoms like ARP staleness and dropout can have many different root causes, this kind of confirmation bias in an AI response can send you in completely the wrong direction and waste a significant amount of time.

The fault was ultimately diagnosed correctly by using Claude (Anthropic) to analyse the raw ARP table output, the interface names, the NIC hardware details, and the specific symptom of cable-replug restoring connectivity. That combination of clues pointed specifically to the e1000e hardware offloading hang rather than any router or ARP configuration issue.

The lesson here is practical:
  • Do not rely on a single AI tool for complex technical diagnosis
  • Provide raw command output rather than describing symptoms in plain language -- AI tools reason much more accurately from actual data
  • If an AI diagnosis does not match what you are observing, try a different tool with the same data
  • Cross-reference any AI suggestion against the relevant community forums (in this case the Proxmox forum, where this exact bug is documented across multiple threads)
  • AI tools are genuinely useful for this kind of diagnosis but they are not infallible -- treat their output as a starting point for investigation, not a final answer
In this case the correct fix was found, confirmed, and is now running stably. But the wrong diagnosis from one AI tool could easily have led to unnecessary router replacements, firmware changes, or hours of network reconfiguration that would have had no effect whatsoever.