Proxmox 6.8.12-9-pve kernel has introduced a problem with e1000e Driver and network connection lost after some hours

I have similar problems, my proxmox host drops out from the network occasionally, it has been stable for over one year but it just started happening now so I guess it is related to an upgrade as mentioned earlier in this thread. I have to plug the ethernet cable out and in to get it back.

I am running 6.8.12-11-pve kernal.

This is what I find if I run ethtool -i enp0s31f6
Code:
driver: e1000e
version: 6.8.12-11-pve
firmware-version: 2.3-4
expansion-rom-version:
bus-info: 0000:00:1f.6
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

It just happened now and the logs informed me about hardware Unit Hang:
dmesg | tail -100

Code:
MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292707.471193] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                  TDH                  <a1>
                  TDT                  <a6>
                  next_to_use          <a6>
                  next_to_clean        <a0>
                buffer_info[next_to_clean]:
                  time_stamp           <111694031>
                  next_to_watch        <a1>
                  jiffies              <1116dd941>
                  next_to_watch.status <0>
                MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292709.455165] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                  TDH                  <a1>
                  TDT                  <a6>
                  next_to_use          <a6>
                  next_to_clean        <a0>
                buffer_info[next_to_clean]:
                  time_stamp           <111694031>
                  next_to_watch        <a1>
                  jiffies              <1116de101>
                  next_to_watch.status <0>
                MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292711.439132] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                  TDH                  <a1>
                  TDT                  <a6>
                  next_to_use          <a6>
                  next_to_clean        <a0>
                buffer_info[next_to_clean]:
                  time_stamp           <111694031>
                  next_to_watch        <a1>
                  jiffies              <1116de8c1>
                  next_to_watch.status <0>
                MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292713.486122] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                  TDH                  <a1>
                  TDT                  <a6>
                  next_to_use          <a6>
                  next_to_clean        <a0>
                buffer_info[next_to_clean]:
                  time_stamp           <111694031>
                  next_to_watch        <a1>
                  jiffies              <1116df0c0>
                  next_to_watch.status <0>
                MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292715.470168] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                  TDH                  <a1>
                  TDT                  <a6>
                  next_to_use          <a6>
                  next_to_clean        <a0>
                buffer_info[next_to_clean]:
                  time_stamp           <111694031>
                  next_to_watch        <a1>
                  jiffies              <1116df880>
                  next_to_watch.status <0>
                MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292717.454083] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                  TDH                  <a1>
                  TDT                  <a6>
                  next_to_use          <a6>
                  next_to_clean        <a0>
                buffer_info[next_to_clean]:
                  time_stamp           <111694031>
                  next_to_watch        <a1>
                  jiffies              <1116e0040>
                  next_to_watch.status <0>
                MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292718.471934] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down
[292718.558360] vmbr0: port 1(enp0s31f6) entered disabled state
[292726.136923] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[292726.136966] vmbr0: port 1(enp0s31f6) entered blocking state
[292726.136974] vmbr0: port 1(enp0s31f6) entered forwarding state


journalctl --since "10 minutes ago" --no-pager | grep -Ei 'network|link|enp0s31f6|vmbr0|e1000e'

Code:
Jun 05 20:31:37 pve-acer-veriton kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
Jun 05 20:31:39 pve-acer-veriton kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
...
Jun 05 20:36:45 pve-acer-veriton kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
Jun 05 20:36:47 pve-acer-veriton kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
Jun 05 20:36:48 pve-acer-veriton kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down
Jun 05 20:36:48 pve-acer-veriton kernel: vmbr0: port 1(enp0s31f6) entered disabled state
Jun 05 20:36:56 pve-acer-veriton kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jun 05 20:36:56 pve-acer-veriton kernel: vmbr0: port 1(enp0s31f6) entered blocking state
Jun 05 20:36:56 pve-acer-veriton kernel: vmbr0: port 1(enp0s31f6) entered forwarding state
Jun 05 20:37:43 pve-acer-veriton systemd[1252867]: Listening on dirmngr.socket - GnuPG network certificate management daemon.

ip -s link show enp0s31f6

Code:
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether d4:61:37:01:c8:33 brd ff:ff:ff:ff:ff:ff
    RX:    bytes   packets errors dropped  missed   mcast         
     35711121667  38257968      0    9749    2413 1347507
    TX:    bytes   packets errors dropped carrier collsns         
    221063815294 157587082      0       0       0       0

After consultation with chatgpt I did the following:

Disabled Energy Efficient Ethernet (EEE)
EEE can apparently cause link flapping or power-saving quirks.

Created a /etc/systemd/system/disable-eee.service file.

Code:
[Unit]
Description=Disable EEE on enp0s31f6
After=network.target

[Service]
ExecStart=/sbin/ethtool --set-eee enp0s31f6 eee off
Type=oneshot
RemainAfterExit=true

[Install]
WantedBy=multi-user.target

activate with:
Code:
systemctl daemon-reexec
systemctl enable --now disable-eee.service

Tuned e1000e driver settings
Created /etc/modprobe.d/e1000e.conf and filled it with:
Code:
options e1000e InterruptThrottleRate=0,0 RxIntDelay=0 TxIntDelay=0
options e1000e enable_eee=0

Applied those changes:
Code:
update-initramfs -u -k all
reboot

That seemed to work, my host was stable for 2 days but today it acted up again, I found this post and applied the ethtool fix suggested here, I put it in /etc/network/interfaces as such:

Code:
iface vmbr0 inet static
    address 192.168.X.X/24
    gateway 192.168.X.X
    bridge-ports enp0s31f6
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
        post-up ethtool -K enp0s31f6 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off

I guess all I can do now is to wait and see if this helps..

Does anyone know if what I have done is legit or if it can have unintended consequences?
I noticed the none of you guys have done the EEE disabling or the driver tuning.. Is this something that I should perhaps remove?
Thank you! This is basically what I went through as well. Unfortunately, it did not help. On top of that, my USB NICs were added to create a bond (active-backup) but _even then_ that didn't work since the first NIC was alive but useless so the second NIC didn't come into play. The only way I could get things going was to use round-robin and I have lots of packet loss and latency. Unplugging the first NIC 'fixed' it until I could reboot.

Some are mentioning 219 (21) as the problem but I'm running rev. 10 and cannot make things work. I think I can afford downgrading my kernel to .8 (as some are suggesting) now that I have a backup NIC.
```
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (7) I219-V [8086:15bc] (rev 10)
DeviceName: Onboard - Ethernet
Subsystem: Lenovo Ethernet Connection (7) I219-V [17aa:312a]
Kernel driver in use: e1000e
Kernel modules: e1000e
```
 
Last edited:
Is there any progress on this topic. I just updated proxmox as I thought this should be fixed as a long time has gone... directly after a reboot network gone.
I thought I bricked my device. After a simple disconnect-reconnect of the ethernet cable I directly found the device. So I guess we still have the issue of the kernel not supporting this NIC correctly, right?
 
  • Like
Reactions: sammyke007
Well, would be great if somebody from the Proxmox Team would take the time to look into this...
Yes please. We need your help on this. Yesterday I updated my machine. Directly after a reboot the machine was lost on the network. Cable out, cable in -> There again. Tomorrow morning the device was lost on the network again.:(
Please offer an option for those buggy NIC kernels :)