Network crash after 3 or 4 hours

arnaudp · Feb 7, 2023

Hi,

we have an issue with our a dual 25Gb network card, that "crash" after 3 or 4 hours:

error report:
Feb 07 11:27:17 pve-03 kernel: ice 0000:98:00.0 irdma0: ICE OICR event notification: oicr = 0x04000003
Feb 07 11:27:17 pve-03 kernel: ice 0000:98:00.0 irdma0: HMC Error
Feb 07 11:27:17 pve-03 kernel: ice 0000:98:00.0 irdma0: Requesting a reset
Feb 07 11:27:19 pve-03 kernel: ice 0000:98:00.0: Removed PTP clock
Feb 07 11:27:19 pve-03 kernel: ice 0000:98:00.0: Clearing default VSI, re-enable after reset completes
Feb 07 11:27:30 pve-03 kernel: vmbr0: port 1(enp152s0f0) entered disabled state
Feb 07 11:27:30 pve-03 kernel: ice 0000:98:00.0: PTP init successful
Feb 07 11:27:32 pve-03 pvestatd[2632]: Backup: error fetching datastores - 500 Can't connect to 172.16.110.233:8007 (Connection timed out)
Feb 07 11:27:32 pve-03 pvestatd[2632]: status update time (14.178 seconds)
Feb 07 11:27:35 pve-03 kernel: ice 0000:98:00.0: VSI rebuilt. VSI index 0, type ICE_VSI_PF
Feb 07 11:27:35 pve-03 kernel: ice 0000:98:00.0: VSI rebuilt. VSI index 383, type ICE_VSI_CTRL
Feb 07 11:27:37 pve-03 kernel: vmbr0: port 1(enp152s0f0) entered blocking state
Feb 07 11:27:37 pve-03 kernel: vmbr0: port 1(enp152s0f0) entered forwarding state

enp152s0f0 is vlan aware and only configure by vlan on vmbr0. no bonding, and defaut linux port.

after this error network nothing seem to be break, but all vms on the node lost network connection, rebooting them do not help.
the only wait to recover it to restart proxmox server.
restarting networking service seem to reboot the computer.
proxmox version is the last one 7.3.4 with kernel 5.15.83-1-pve

does any one have this type of error ?

Moayad · Feb 7, 2023

Hello,

Have you checked if that NIC has a new update of the firmware? If it is up-to-date, I would try an older pve-kernel to narrow down the case.

arnaudp · Feb 7, 2023

not yet i'm going to check if there is an update.
and then start rollback to original 7.3 kernel 5.15.74 to test.
thanks

arnaudp · Feb 10, 2023

I try to switch to opt-in kernel 6.1.6 and this fix the issue, I'm still working with NIC support to check if ther is a fix with a new Firmware on 5.15 kernel.

fst · Feb 11, 2023

The same thing happened to me three times this day. Now my site does not work at all anymore and I am in a foeign country and can not even reboot. Have you figured out a kernel version that works? This is a worst case scenario.

arnaudp · Feb 12, 2023

today kernel 6.1.6 is stable for me, since tuesday, so thats the solution i've found...
I'm waiting for news from our provider, about any firmware update or kernel issue.

fst · Mar 16, 2023

Thanks. It worked for me. Here is more feedback about 6.2.x. Do you have the same symptoms?
https://forum.proxmox.com/threads/opt-in-linux-6-2-kernel-for-proxmox-ve-7-x-available.124189/

Ivanhome · Apr 10, 2023

Try to change the DNS to 8.8.8.8, I fixed in that way

Jorge Peixoto · Jun 1, 2023

arnaudp said:
Hi,

we have an issue with our a dual 25Gb network card, that "crash" after 3 or 4 hours:

error report:
Feb 07 11:27:17 pve-03 kernel: ice 0000:98:00.0 irdma0: ICE OICR event notification: oicr = 0x04000003
Feb 07 11:27:17 pve-03 kernel: ice 0000:98:00.0 irdma0: HMC Error
Feb 07 11:27:17 pve-03 kernel: ice 0000:98:00.0 irdma0: Requesting a reset
Feb 07 11:27:19 pve-03 kernel: ice 0000:98:00.0: Removed PTP clock
Feb 07 11:27:19 pve-03 kernel: ice 0000:98:00.0: Clearing default VSI, re-enable after reset completes
Feb 07 11:27:30 pve-03 kernel: vmbr0: port 1(enp152s0f0) entered disabled state
Feb 07 11:27:30 pve-03 kernel: ice 0000:98:00.0: PTP init successful
Feb 07 11:27:32 pve-03 pvestatd[2632]: Backup: error fetching datastores - 500 Can't connect to 172.16.110.233:8007 (Connection timed out)
Feb 07 11:27:32 pve-03 pvestatd[2632]: status update time (14.178 seconds)
Feb 07 11:27:35 pve-03 kernel: ice 0000:98:00.0: VSI rebuilt. VSI index 0, type ICE_VSI_PF
Feb 07 11:27:35 pve-03 kernel: ice 0000:98:00.0: VSI rebuilt. VSI index 383, type ICE_VSI_CTRL
Feb 07 11:27:37 pve-03 kernel: vmbr0: port 1(enp152s0f0) entered blocking state
Feb 07 11:27:37 pve-03 kernel: vmbr0: port 1(enp152s0f0) entered forwarding state

enp152s0f0 is vlan aware and only configure by vlan on vmbr0. no bonding, and defaut linux port.

after this error network nothing seem to be break, but all vms on the node lost network connection, rebooting them do not help.
the only wait to recover it to restart proxmox server.
restarting networking service seem to reboot the computer.
proxmox version is the last one 7.3.4 with kernel 5.15.83-1-pve

does any one have this type of error ?

Me.

Check this out:

My scenario:
HPE ProLiant DL380 Gen10 Plus
Two ethernet adapter configured as LACP OVS bonding: Intel(R) Eth E810-XXVDA2 and Intel(R) Eth Ntwk Adptr OCP3.0 E810-XXVDA2. Both running firmware version 3.20. Both have successfully loaded DDP package ICE COMMS Package version 1.3.40.0
Proxmox VE 7.4 running kernel 5.15.85-1-pve or 5.19.17-2-pve

PVE runs stable if running on kernel 5.15.39-1-pve.

To upgrade kernel to 6.2 is not an option AFAIK it's incompatible with DRBD/LINSTOR.
https://forum.proxmox.com/threads/kernel-6-2-drbd-inkompatibel.126902/

kev1904 · Aug 21, 2023

Faceing the Same issue with DL325 Gen10 Plus, Network/Server Crash after a few days, did you Solved it ? We use 100Gbit/s QSFP Adapters from Intel (E810-C) HP Certified.

fst · Aug 21, 2023

kev1904 said:
Faceing the Same issue with DL325 Gen10 Plus, Network/Server Crash after a few days, did you Solved it ? We use 100Gbit/s QSFP Adapters from Intel (E810-C) HP Certified.

I am currently using ProxMox 8.0 with 6.2 kernel. Can not help you on 5.15. It did not crash so far. However I am no longer running pfSense on that machine so there is less network traffic.

Jorge Peixoto · Aug 21, 2023

I managed to fix it upgrading kernel to 6.2.

Simryc · Feb 13, 2024

I also have problems with the Intel 810-C card. I replaced the card with a new one and everything was fine for over a week until communication today broke down again.
Kernel 6.5.11-8-pve, proxmo 8.1.4. Has anyone managed to find the cause?
dmesg:

Code:

[132894.250139] ice 0000:41:00.0 irdma0: ICE OICR event notification: oicr = 0x04000003
[132894.250394] ice 0000:41:00.0 irdma0: HMC Error
[132894.250582] ice 0000:41:00.0 irdma0: Requesting a reset
[132895.311016] vmbr0: port 1(enp65s0f0np0) entered disabled state

The first idea was about the incompatibility of drivers with Debian 12, but as I have read to others, these cards work fine.

kev1904 · Feb 13, 2024

Simryc said:
I also have problems with the Intel 810-C card. I replaced the card with a new one and everything was fine for over a week until communication today broke down again.
Kernel 6.5.11-8-pve, proxmo 8.1.4. Has anyone managed to find the cause?
dmesg:

Code:

[132894.250139] ice 0000:41:00.0 irdma0: ICE OICR event notification: oicr = 0x04000003 [132894.250394] ice 0000:41:00.0 irdma0: HMC Error [132894.250582] ice 0000:41:00.0 irdma0: Requesting a reset [132895.311016] vmbr0: port 1(enp65s0f0np0) entered disabled state

The first idea was about the incompatibility of drivers with Debian 12, but as I have read to others, these cards work fine.

Hi Simryc,

I've tried a lot to fix this issue. It seems that using 100Gbit/s network cards is still too new. Ultimately, we replaced the Intel cards with Mellanox (NVIDIA) ConnectX-6, and they've been running without any problems for months.

Simryc · Feb 18, 2024

After a new bios for the motherboard and a new network card, everything seems to be working

TedC · Oct 17, 2024

We are seeing this issue on 3 SuperMicro boxes with `ice` based AOC network cards. I've opened this issue with the Intel Driver Team to see if we can chase down the root cause:

https://github.com/intel/ethernet-linux-ice/issues/12

firth · Oct 17, 2024

TedC said:
We are seeing this issue on 3 SuperMicro boxes with `ice` based AOC network cards. I've opened this issue with the Intel Driver Team to see if we can chase down the root cause:

https://github.com/intel/ethernet-linux-ice/issues/12

Same thing here, with Intel E810-C in Supermicro SYS-221H-TN24R.

TedC · Oct 17, 2024

Can you please comment / on the GitHub issue to increase the chances Intel looks at it? I haven't yet reached out the SuperMicro, but that's another approach we could take.

TedC · Nov 1, 2024

This appears to be a bug in the interaction between the Intel NIC and RDMA drivers. I have not been able to reproduce the error with the irdma and Infiniband modules unloaded from the kernel. The Intel NIC driver does try to load irdma if available, so for a long-term solution, you'd want to blacklist the module. I'm continuing my testing over the weekend, and if I don't see any ongoing issues, I plan to implement the blacklist solution across the board.

TedC · Nov 8, 2024

Over the past week, I have confirmed that preventing the loading of the RDMA and IB modules prevents this error.

Network crash after 3 or 4 hours

New Member

Proxmox Staff Member

New Member

New Member

Member

New Member

Member

Member

Renowned Member

Well-Known Member

Member

Renowned Member

Member

Well-Known Member

Member

New Member

Member

New Member

New Member

New Member

We value your privacy