Network crash after 3 or 4 hours

arnaudp

New Member
Feb 7, 2023
4
0
1
Hi,

we have an issue with our a dual 25Gb network card, that "crash" after 3 or 4 hours:

error report:
Feb 07 11:27:17 pve-03 kernel: ice 0000:98:00.0 irdma0: ICE OICR event notification: oicr = 0x04000003
Feb 07 11:27:17 pve-03 kernel: ice 0000:98:00.0 irdma0: HMC Error
Feb 07 11:27:17 pve-03 kernel: ice 0000:98:00.0 irdma0: Requesting a reset
Feb 07 11:27:19 pve-03 kernel: ice 0000:98:00.0: Removed PTP clock
Feb 07 11:27:19 pve-03 kernel: ice 0000:98:00.0: Clearing default VSI, re-enable after reset completes
Feb 07 11:27:30 pve-03 kernel: vmbr0: port 1(enp152s0f0) entered disabled state
Feb 07 11:27:30 pve-03 kernel: ice 0000:98:00.0: PTP init successful
Feb 07 11:27:32 pve-03 pvestatd[2632]: Backup: error fetching datastores - 500 Can't connect to 172.16.110.233:8007 (Connection timed out)
Feb 07 11:27:32 pve-03 pvestatd[2632]: status update time (14.178 seconds)
Feb 07 11:27:35 pve-03 kernel: ice 0000:98:00.0: VSI rebuilt. VSI index 0, type ICE_VSI_PF
Feb 07 11:27:35 pve-03 kernel: ice 0000:98:00.0: VSI rebuilt. VSI index 383, type ICE_VSI_CTRL
Feb 07 11:27:37 pve-03 kernel: vmbr0: port 1(enp152s0f0) entered blocking state
Feb 07 11:27:37 pve-03 kernel: vmbr0: port 1(enp152s0f0) entered forwarding state

enp152s0f0 is vlan aware and only configure by vlan on vmbr0. no bonding, and defaut linux port.


after this error network nothing seem to be break, but all vms on the node lost network connection, rebooting them do not help.
the only wait to recover it to restart proxmox server.
restarting networking service seem to reboot the computer.
proxmox version is the last one 7.3.4 with kernel 5.15.83-1-pve

does any one have this type of error ?
 
Hello,

Have you checked if that NIC has a new update of the firmware? If it is up-to-date, I would try an older pve-kernel to narrow down the case.
 
not yet i'm going to check if there is an update.
and then start rollback to original 7.3 kernel 5.15.74 to test.
thanks
 
I try to switch to opt-in kernel 6.1.6 and this fix the issue, I'm still working with NIC support to check if ther is a fix with a new Firmware on 5.15 kernel.
 
The same thing happened to me three times this day. Now my site does not work at all anymore and I am in a foeign country and can not even reboot. Have you figured out a kernel version that works? This is a worst case scenario.
 
today kernel 6.1.6 is stable for me, since tuesday, so thats the solution i've found...
I'm waiting for news from our provider, about any firmware update or kernel issue.
 
Hi,

we have an issue with our a dual 25Gb network card, that "crash" after 3 or 4 hours:

error report:
Feb 07 11:27:17 pve-03 kernel: ice 0000:98:00.0 irdma0: ICE OICR event notification: oicr = 0x04000003
Feb 07 11:27:17 pve-03 kernel: ice 0000:98:00.0 irdma0: HMC Error
Feb 07 11:27:17 pve-03 kernel: ice 0000:98:00.0 irdma0: Requesting a reset
Feb 07 11:27:19 pve-03 kernel: ice 0000:98:00.0: Removed PTP clock
Feb 07 11:27:19 pve-03 kernel: ice 0000:98:00.0: Clearing default VSI, re-enable after reset completes
Feb 07 11:27:30 pve-03 kernel: vmbr0: port 1(enp152s0f0) entered disabled state
Feb 07 11:27:30 pve-03 kernel: ice 0000:98:00.0: PTP init successful
Feb 07 11:27:32 pve-03 pvestatd[2632]: Backup: error fetching datastores - 500 Can't connect to 172.16.110.233:8007 (Connection timed out)
Feb 07 11:27:32 pve-03 pvestatd[2632]: status update time (14.178 seconds)
Feb 07 11:27:35 pve-03 kernel: ice 0000:98:00.0: VSI rebuilt. VSI index 0, type ICE_VSI_PF
Feb 07 11:27:35 pve-03 kernel: ice 0000:98:00.0: VSI rebuilt. VSI index 383, type ICE_VSI_CTRL
Feb 07 11:27:37 pve-03 kernel: vmbr0: port 1(enp152s0f0) entered blocking state
Feb 07 11:27:37 pve-03 kernel: vmbr0: port 1(enp152s0f0) entered forwarding state

enp152s0f0 is vlan aware and only configure by vlan on vmbr0. no bonding, and defaut linux port.


after this error network nothing seem to be break, but all vms on the node lost network connection, rebooting them do not help.
the only wait to recover it to restart proxmox server.
restarting networking service seem to reboot the computer.
proxmox version is the last one 7.3.4 with kernel 5.15.83-1-pve

does any one have this type of error ?
Me.

Check this out:
1685568988513.png

My scenario:
HPE ProLiant DL380 Gen10 Plus
Two ethernet adapter configured as LACP OVS bonding: Intel(R) Eth E810-XXVDA2 and Intel(R) Eth Ntwk Adptr OCP3.0 E810-XXVDA2. Both running firmware version 3.20. Both have successfully loaded DDP package ICE COMMS Package version 1.3.40.0
Proxmox VE 7.4 running kernel 5.15.85-1-pve or 5.19.17-2-pve

PVE runs stable if running on kernel 5.15.39-1-pve.

To upgrade kernel to 6.2 is not an option AFAIK it's incompatible with DRBD/LINSTOR.
https://forum.proxmox.com/threads/kernel-6-2-drbd-inkompatibel.126902/
 
Faceing the Same issue with DL325 Gen10 Plus, Network/Server Crash after a few days, did you Solved it ? We use 100Gbit/s QSFP Adapters from Intel (E810-C) HP Certified.
 
Faceing the Same issue with DL325 Gen10 Plus, Network/Server Crash after a few days, did you Solved it ? We use 100Gbit/s QSFP Adapters from Intel (E810-C) HP Certified.
I am currently using ProxMox 8.0 with 6.2 kernel. Can not help you on 5.15. It did not crash so far. However I am no longer running pfSense on that machine so there is less network traffic.
 
I also have problems with the Intel 810-C card. I replaced the card with a new one and everything was fine for over a week until communication today broke down again.
Kernel 6.5.11-8-pve, proxmo 8.1.4. Has anyone managed to find the cause?
dmesg:
Code:
[132894.250139] ice 0000:41:00.0 irdma0: ICE OICR event notification: oicr = 0x04000003
[132894.250394] ice 0000:41:00.0 irdma0: HMC Error
[132894.250582] ice 0000:41:00.0 irdma0: Requesting a reset
[132895.311016] vmbr0: port 1(enp65s0f0np0) entered disabled state
The first idea was about the incompatibility of drivers with Debian 12, but as I have read to others, these cards work fine.
 
I also have problems with the Intel 810-C card. I replaced the card with a new one and everything was fine for over a week until communication today broke down again.
Kernel 6.5.11-8-pve, proxmo 8.1.4. Has anyone managed to find the cause?
dmesg:
Code:
[132894.250139] ice 0000:41:00.0 irdma0: ICE OICR event notification: oicr = 0x04000003
[132894.250394] ice 0000:41:00.0 irdma0: HMC Error
[132894.250582] ice 0000:41:00.0 irdma0: Requesting a reset
[132895.311016] vmbr0: port 1(enp65s0f0np0) entered disabled state
The first idea was about the incompatibility of drivers with Debian 12, but as I have read to others, these cards work fine.
Hi Simryc,

I've tried a lot to fix this issue. It seems that using 100Gbit/s network cards is still too new. Ultimately, we replaced the Intel cards with Mellanox (NVIDIA) ConnectX-6, and they've been running without any problems for months.
 
Can you please comment / on the GitHub issue to increase the chances Intel looks at it? I haven't yet reached out the SuperMicro, but that's another approach we could take.
 
This appears to be a bug in the interaction between the Intel NIC and RDMA drivers. I have not been able to reproduce the error with the irdma and Infiniband modules unloaded from the kernel. The Intel NIC driver does try to load irdma if available, so for a long-term solution, you'd want to blacklist the module. I'm continuing my testing over the weekend, and if I don't see any ongoing issues, I plan to implement the blacklist solution across the board.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!