Proxmox 6.8.12-9-pve kernel has introduced a problem with e1000e Driver and network connection lost after some hours

6.8.12-11-pve was released for both PVE and PBS, there were some changes to the ABI.
https://git.proxmox.com/?p=pve-kernel.git;a=shortlog;h=refs/heads/bookworm-6.8

ABI stands for Application Binary Interface. It defines the low-level interface between the kernel and its modules (such as device drivers), specifying how compiled code interacts with the kernel at the binary level. This includes details like register usage, memory layout, calling conventions, and symbol versions of exported kernel functions and variables
 
I still have this issue also with kernel 6.8.12-11-pve. Will this be fixed in a future kernel or do I have to apply the ethtool workaround?
 
For what it's worth (and may help in the debug process): I have 3 Systems in my Home Lab with Intel 217/219 NICs.

Only the 217-LM (Rev04) is affected by freezes. The other systems with 219-V (Rev 21) and 219-V (Rev 31) are working fine with all kernels up
to 6.8.12-10-pve (did not test -11 so far).

Main production Servers with Intel X710 and X550 are also working fine with kernels -8, -9 and -10.
 
We have a few machines with `Intel Corporation Ethernet Connection (17) I219-LM (rev 11)` that are having this issue on the 6.8.12-11-pve kernel.
Thanks for letting us know. We should wait for a fix in a future kernel. We hope to have someone from Proxmox looking into this soon.
Meanwhile, still running the pinned 6.8.12-8
 
I also have this issue, all my LXCs lose connection on sporadic intervals but always many hours apart, I have uptimekuma set up on a different server to monitor this for me and it shows.

Network card:

07:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)

PVE:

proxmox-ve: 8.4.0 (running kernel: 6.8.12-10-pve)
 
Last edited:
I've upgraded my system on Tuesday from 6.8.12-10-pve to 6.8.12-11-pve and it is still online, so yesterday I updated another system, which is technically identical. That one froze just few hours after the upgrade. The main difference is that the second one is also running PBS in a PVE VM.
 
Last edited:
Is it possible that on a network card so common in millions of PCs this very serious bug is still present? absurd
 
I forgot tomention that I had implemented the offloading config in /etc/network/interfaces on my system. So that did help here.

1748803866098.png
 
Is it possible that on a network card so common in millions of PCs this very serious bug is still present? absurd

Yeah, normally the Proxmox Team and the upstream is pretty good about this, but this is kinda nuts. It's been a couple of months now.
 
Yeah, normally the Proxmox Team and the upstream is pretty good about this, but this is kinda nuts. It's been a couple of months now.
Sadly not months but years. This thread exists from September 2019
 
I have similar problems, my proxmox host drops out from the network occasionally, it has been stable for over one year but it just started happening now so I guess it is related to an upgrade as mentioned earlier in this thread. I have to plug the ethernet cable out and in to get it back.

I am running 6.8.12-11-pve kernal.

This is what I find if I run ethtool -i enp0s31f6
Code:
driver: e1000e
version: 6.8.12-11-pve
firmware-version: 2.3-4
expansion-rom-version:
bus-info: 0000:00:1f.6
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

It just happened now and the logs informed me about hardware Unit Hang:
dmesg | tail -100

Code:
MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292707.471193] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                  TDH                  <a1>
                  TDT                  <a6>
                  next_to_use          <a6>
                  next_to_clean        <a0>
                buffer_info[next_to_clean]:
                  time_stamp           <111694031>
                  next_to_watch        <a1>
                  jiffies              <1116dd941>
                  next_to_watch.status <0>
                MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292709.455165] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                  TDH                  <a1>
                  TDT                  <a6>
                  next_to_use          <a6>
                  next_to_clean        <a0>
                buffer_info[next_to_clean]:
                  time_stamp           <111694031>
                  next_to_watch        <a1>
                  jiffies              <1116de101>
                  next_to_watch.status <0>
                MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292711.439132] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                  TDH                  <a1>
                  TDT                  <a6>
                  next_to_use          <a6>
                  next_to_clean        <a0>
                buffer_info[next_to_clean]:
                  time_stamp           <111694031>
                  next_to_watch        <a1>
                  jiffies              <1116de8c1>
                  next_to_watch.status <0>
                MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292713.486122] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                  TDH                  <a1>
                  TDT                  <a6>
                  next_to_use          <a6>
                  next_to_clean        <a0>
                buffer_info[next_to_clean]:
                  time_stamp           <111694031>
                  next_to_watch        <a1>
                  jiffies              <1116df0c0>
                  next_to_watch.status <0>
                MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292715.470168] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                  TDH                  <a1>
                  TDT                  <a6>
                  next_to_use          <a6>
                  next_to_clean        <a0>
                buffer_info[next_to_clean]:
                  time_stamp           <111694031>
                  next_to_watch        <a1>
                  jiffies              <1116df880>
                  next_to_watch.status <0>
                MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292717.454083] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                  TDH                  <a1>
                  TDT                  <a6>
                  next_to_use          <a6>
                  next_to_clean        <a0>
                buffer_info[next_to_clean]:
                  time_stamp           <111694031>
                  next_to_watch        <a1>
                  jiffies              <1116e0040>
                  next_to_watch.status <0>
                MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292718.471934] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down
[292718.558360] vmbr0: port 1(enp0s31f6) entered disabled state
[292726.136923] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[292726.136966] vmbr0: port 1(enp0s31f6) entered blocking state
[292726.136974] vmbr0: port 1(enp0s31f6) entered forwarding state


journalctl --since "10 minutes ago" --no-pager | grep -Ei 'network|link|enp0s31f6|vmbr0|e1000e'

Code:
Jun 05 20:31:37 pve-acer-veriton kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
Jun 05 20:31:39 pve-acer-veriton kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
...
Jun 05 20:36:45 pve-acer-veriton kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
Jun 05 20:36:47 pve-acer-veriton kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
Jun 05 20:36:48 pve-acer-veriton kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down
Jun 05 20:36:48 pve-acer-veriton kernel: vmbr0: port 1(enp0s31f6) entered disabled state
Jun 05 20:36:56 pve-acer-veriton kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jun 05 20:36:56 pve-acer-veriton kernel: vmbr0: port 1(enp0s31f6) entered blocking state
Jun 05 20:36:56 pve-acer-veriton kernel: vmbr0: port 1(enp0s31f6) entered forwarding state
Jun 05 20:37:43 pve-acer-veriton systemd[1252867]: Listening on dirmngr.socket - GnuPG network certificate management daemon.

ip -s link show enp0s31f6

Code:
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether d4:61:37:01:c8:33 brd ff:ff:ff:ff:ff:ff
    RX:    bytes   packets errors dropped  missed   mcast           
     35711121667  38257968      0    9749    2413 1347507
    TX:    bytes   packets errors dropped carrier collsns           
    221063815294 157587082      0       0       0       0

After consultation with chatgpt I did the following:

Disabled Energy Efficient Ethernet (EEE)
EEE can apparently cause link flapping or power-saving quirks.

Created a /etc/systemd/system/disable-eee.service file.

Code:
[Unit]
Description=Disable EEE on enp0s31f6
After=network.target

[Service]
ExecStart=/sbin/ethtool --set-eee enp0s31f6 eee off
Type=oneshot
RemainAfterExit=true

[Install]
WantedBy=multi-user.target

activate with:
Code:
systemctl daemon-reexec
systemctl enable --now disable-eee.service

Tuned e1000e driver settings
Created /etc/modprobe.d/e1000e.conf and filled it with:
Code:
options e1000e InterruptThrottleRate=0,0 RxIntDelay=0 TxIntDelay=0
options e1000e enable_eee=0

Applied those changes:
Code:
update-initramfs -u -k all
reboot

That seemed to work, my host was stable for 2 days but today it acted up again, I found this post and applied the ethtool fix suggested here, I put it in /etc/network/interfaces as such:

Code:
iface vmbr0 inet static
    address 192.168.X.X/24
    gateway 192.168.X.X
    bridge-ports enp0s31f6
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
        post-up ethtool -K enp0s31f6 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off

I guess all I can do now is to wait and see if this helps..

Does anyone know if what I have done is legit or if it can have unintended consequences?
I noticed the none of you guys have done the EEE disabling or the driver tuning.. Is this something that I should perhaps remove?
 
  • Like
Reactions: SelfMan
I usually do remove the power management stuff on windows, because of "issues" with devices going to sleep and then not communicating, but in this case I am not sure if it's the culprit as all references do point to the offloading problem.
At least on my end, that solved the stuck network adapter issue.