Proxmox 6.8.12-9-pve kernel has introduced a problem with e1000e Driver and network connection lost after some hours

I have similar problems, my proxmox host drops out from the network occasionally, it has been stable for over one year but it just started happening now so I guess it is related to an upgrade as mentioned earlier in this thread. I have to plug the ethernet cable out and in to get it back.

I am running 6.8.12-11-pve kernal.

This is what I find if I run ethtool -i enp0s31f6
Code:
driver: e1000e
version: 6.8.12-11-pve
firmware-version: 2.3-4
expansion-rom-version:
bus-info: 0000:00:1f.6
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

It just happened now and the logs informed me about hardware Unit Hang:
dmesg | tail -100

Code:
MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292707.471193] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                  TDH                  <a1>
                  TDT                  <a6>
                  next_to_use          <a6>
                  next_to_clean        <a0>
                buffer_info[next_to_clean]:
                  time_stamp           <111694031>
                  next_to_watch        <a1>
                  jiffies              <1116dd941>
                  next_to_watch.status <0>
                MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292709.455165] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                  TDH                  <a1>
                  TDT                  <a6>
                  next_to_use          <a6>
                  next_to_clean        <a0>
                buffer_info[next_to_clean]:
                  time_stamp           <111694031>
                  next_to_watch        <a1>
                  jiffies              <1116de101>
                  next_to_watch.status <0>
                MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292711.439132] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                  TDH                  <a1>
                  TDT                  <a6>
                  next_to_use          <a6>
                  next_to_clean        <a0>
                buffer_info[next_to_clean]:
                  time_stamp           <111694031>
                  next_to_watch        <a1>
                  jiffies              <1116de8c1>
                  next_to_watch.status <0>
                MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292713.486122] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                  TDH                  <a1>
                  TDT                  <a6>
                  next_to_use          <a6>
                  next_to_clean        <a0>
                buffer_info[next_to_clean]:
                  time_stamp           <111694031>
                  next_to_watch        <a1>
                  jiffies              <1116df0c0>
                  next_to_watch.status <0>
                MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292715.470168] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                  TDH                  <a1>
                  TDT                  <a6>
                  next_to_use          <a6>
                  next_to_clean        <a0>
                buffer_info[next_to_clean]:
                  time_stamp           <111694031>
                  next_to_watch        <a1>
                  jiffies              <1116df880>
                  next_to_watch.status <0>
                MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292717.454083] e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
                  TDH                  <a1>
                  TDT                  <a6>
                  next_to_use          <a6>
                  next_to_clean        <a0>
                buffer_info[next_to_clean]:
                  time_stamp           <111694031>
                  next_to_watch        <a1>
                  jiffies              <1116e0040>
                  next_to_watch.status <0>
                MAC Status             <40080083>
                PHY Status             <796d>
                PHY 1000BASE-T Status  <3800>
                PHY Extended Status    <3000>
                PCI Status             <10>
[292718.471934] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down
[292718.558360] vmbr0: port 1(enp0s31f6) entered disabled state
[292726.136923] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[292726.136966] vmbr0: port 1(enp0s31f6) entered blocking state
[292726.136974] vmbr0: port 1(enp0s31f6) entered forwarding state


journalctl --since "10 minutes ago" --no-pager | grep -Ei 'network|link|enp0s31f6|vmbr0|e1000e'

Code:
Jun 05 20:31:37 pve-acer-veriton kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
Jun 05 20:31:39 pve-acer-veriton kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
...
Jun 05 20:36:45 pve-acer-veriton kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
Jun 05 20:36:47 pve-acer-veriton kernel: e1000e 0000:00:1f.6 enp0s31f6: Detected Hardware Unit Hang:
Jun 05 20:36:48 pve-acer-veriton kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down
Jun 05 20:36:48 pve-acer-veriton kernel: vmbr0: port 1(enp0s31f6) entered disabled state
Jun 05 20:36:56 pve-acer-veriton kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Jun 05 20:36:56 pve-acer-veriton kernel: vmbr0: port 1(enp0s31f6) entered blocking state
Jun 05 20:36:56 pve-acer-veriton kernel: vmbr0: port 1(enp0s31f6) entered forwarding state
Jun 05 20:37:43 pve-acer-veriton systemd[1252867]: Listening on dirmngr.socket - GnuPG network certificate management daemon.

ip -s link show enp0s31f6

Code:
2: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether d4:61:37:01:c8:33 brd ff:ff:ff:ff:ff:ff
    RX:    bytes   packets errors dropped  missed   mcast         
     35711121667  38257968      0    9749    2413 1347507
    TX:    bytes   packets errors dropped carrier collsns         
    221063815294 157587082      0       0       0       0

After consultation with chatgpt I did the following:

Disabled Energy Efficient Ethernet (EEE)
EEE can apparently cause link flapping or power-saving quirks.

Created a /etc/systemd/system/disable-eee.service file.

Code:
[Unit]
Description=Disable EEE on enp0s31f6
After=network.target

[Service]
ExecStart=/sbin/ethtool --set-eee enp0s31f6 eee off
Type=oneshot
RemainAfterExit=true

[Install]
WantedBy=multi-user.target

activate with:
Code:
systemctl daemon-reexec
systemctl enable --now disable-eee.service

Tuned e1000e driver settings
Created /etc/modprobe.d/e1000e.conf and filled it with:
Code:
options e1000e InterruptThrottleRate=0,0 RxIntDelay=0 TxIntDelay=0
options e1000e enable_eee=0

Applied those changes:
Code:
update-initramfs -u -k all
reboot

That seemed to work, my host was stable for 2 days but today it acted up again, I found this post and applied the ethtool fix suggested here, I put it in /etc/network/interfaces as such:

Code:
iface vmbr0 inet static
    address 192.168.X.X/24
    gateway 192.168.X.X
    bridge-ports enp0s31f6
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094
        post-up ethtool -K enp0s31f6 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off

I guess all I can do now is to wait and see if this helps..

Does anyone know if what I have done is legit or if it can have unintended consequences?
I noticed the none of you guys have done the EEE disabling or the driver tuning.. Is this something that I should perhaps remove?
Thank you! This is basically what I went through as well. Unfortunately, it did not help. On top of that, my USB NICs were added to create a bond (active-backup) but _even then_ that didn't work since the first NIC was alive but useless so the second NIC didn't come into play. The only way I could get things going was to use round-robin and I have lots of packet loss and latency. Unplugging the first NIC 'fixed' it until I could reboot.

Some are mentioning 219 (21) as the problem but I'm running rev. 10 and cannot make things work. I think I can afford downgrading my kernel to .8 (as some are suggesting) now that I have a backup NIC.
```
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (7) I219-V [8086:15bc] (rev 10)
DeviceName: Onboard - Ethernet
Subsystem: Lenovo Ethernet Connection (7) I219-V [17aa:312a]
Kernel driver in use: e1000e
Kernel modules: e1000e
```
 
Last edited:
Is there any progress on this topic. I just updated proxmox as I thought this should be fixed as a long time has gone... directly after a reboot network gone.
I thought I bricked my device. After a simple disconnect-reconnect of the ethernet cable I directly found the device. So I guess we still have the issue of the kernel not supporting this NIC correctly, right?
 
  • Like
Reactions: sammyke007
Well, would be great if somebody from the Proxmox Team would take the time to look into this...
Yes please. We need your help on this. Yesterday I updated my machine. Directly after a reboot the machine was lost on the network. Cable out, cable in -> There again. Tomorrow morning the device was lost on the network again.:(
Please offer an option for those buggy NIC kernels :)
 
My advice would be to roll back or to get a different NIC. This is a driver kernel bug that goes back years which has resurfaced.
 
My advice would be to roll back or to get a different NIC. This is a driver kernel bug that goes back years which has resurfaced.
I'd guess that a lot of home labs are running on Intel NUC or some Lenovo/Fujitsu/Dell SFF PC, which are all rocking internal NICs. So no option there to change the NIC, as these systems do not offer a PCI slot to add a different NIC. Hence it would be great if the Team at Proxmox would look into the matter and maybe provide a workaround.
 
  • Like
Reactions: Astronaut
I'd guess that a lot of home labs are running on Intel NUC or some Lenovo/Fujitsu/Dell SFF PC, which are all rocking internal NICs. So no option there to change the NIC, as these systems do not offer a PCI slot to add a different NIC. Hence it would be great if the Team at Proxmox would look into the matter and maybe provide a workaround.
Please Open a Ticket.
 
Unfortunately I am unable to conduct tests with the team then. Two of my systems are in my house, which are working fine with the NICs I219-V (rev 31) and I219-V (rev 21). The affected one is on a remote site, so I am unable to run any tests there, as I would lock myself out if anything goes sideways. This system has an Intel I217-LM (rev 04) which IS affected. So for the time being, I am running it with a pinned .8 Kernel, hoping that someone would be able to open a ticket AND have the system on site for tests.
 
I received the answer:

Hello,
You don't need a subscription[0] to open a bug-report in our bugzilla: https://bugzilla.proxmox.com.
That being said - this particular issue - that some Intel NICs tend to run into unit hangs with some kernel-versions is known to some extent[1].


Usually installing the latest firmware for the NIC if available or disabling offloading resolves the issue.


Did you try these mitigations? If they don't help you can also try running the 6.14 opt-in kernel:


https://forum.proxmox.com/threads/o...e-8-available-on-test-no-subscription.164497/

Sadly the issue has been around - and every single fix sent to the kernel mailing list and applied in a kernel version usually

causes some other Intel NICs to have similar issues - so there is no simple fix that fixes all Intel NICs, with all firmware-versions

provided by all Hardware vendors.

If none of the suggestions help - and the issue is not in the list in [1] - feel free to open a new bugzilla entry - and provide
the journal since booting/dmesg and pveversion -v outputs that show the exact issue.


I hope this helps!


stoiko
 
After many tries beside changing the kernel version, I switched to an USB3 GBps (ASIX AX88179) adapter without any issue since.
Maybe I will try again in a few updates if a fix comes.