ICE driver failure

Jan 29, 2025
1
0
1
Hey,
I am fairly new to proxmox VE and there seems to be an issue with networking on my end

I am not 100% sure could be causing this issue but every week at least once i have seen the following failure occure

Code:
Jan 29 11:43:50 node02 kernel: ice 0000:22:00.1 irdma1: ICE OICR event notification: oicr = 0x04000003
Jan 29 11:43:50 node02 kernel: ice 0000:22:00.1 irdma1: HMC Error
Jan 29 11:43:50 node02 kernel: ice 0000:22:00.1 irdma1: Requesting a reset
Jan 29 11:43:51 node02 kernel: DMAR: DRHD: handling fault status reg 2
Jan 29 11:43:51 node02 kernel: DMAR: [DMA Read NO_PASID] Request device [22:00.1] fault addr 0xd61b7000 [fault reason 0x71] SM: Present bit in first-level paging entry is clear
Jan 29 11:43:51 node02 kernel: bond1: (slave eno12409np1): link status definitely down, disabling slave
Jan 29 11:43:51 node02 kernel: bond1: (slave eno12429np3): making interface the new active one
Jan 29 11:43:51 node02 kernel: ice 0000:22:00.1 eno12409np1: left promiscuous mode
Jan 29 11:43:51 node02 kernel: ice 0000:22:00.1 eno12409np1: left allmulticast mode
Jan 29 11:43:51 node02 kernel: ice 0000:22:00.3 eno12429np3: entered promiscuous mode
Jan 29 11:43:51 node02 kernel: ice 0000:22:00.3 eno12429np3: entered allmulticast mode
Jan 29 11:43:52 node02 kernel: ice 0000:22:00.1: PTP reset successful
Jan 29 11:43:55 node02 kernel: ice 0000:22:00.1: VSI rebuilt. VSI index 0, type ICE_VSI_PF
Jan 29 11:43:55 node02 kernel: ice 0000:22:00.1: VSI rebuilt. VSI index 1, type ICE_VSI_CTRL
Jan 29 11:43:55 node02 kernel: bond1: (slave eno12409np1): link status definitely up, 10000 Mbps full duplex
Jan 29 11:43:55 node02 kernel: bond1: (slave eno12409np1): making interface the new active one
Jan 29 11:43:55 node02 kernel: ice 0000:22:00.3 eno12429np3: left promiscuous mode
Jan 29 11:43:55 node02 kernel: ice 0000:22:00.3 eno12429np3: left allmulticast mode
Jan 29 11:43:55 node02 kernel: ice 0000:22:00.1 eno12409np1: entered promiscuous mode
Jan 29 11:43:55 node02 kernel: ice 0000:22:00.1 eno12409np1: entered allmulticast mode
Jan 29 11:45:54 node02 pmxcfs[1746]: [status] notice: received log


This was another crash as well


Code:
Jan 29 06:43:36 node03 kernel: ice 0000:22:00.3 irdma3: ICE OICR event notification: oicr = 0x04000003
Jan 29 06:43:36 node03 kernel: ice 0000:22:00.3 irdma3: HMC Error
Jan 29 06:43:36 node03 kernel: ice 0000:22:00.3 irdma3: Requesting a reset
Jan 29 06:43:37 node03 kernel: bond1: (slave eno12429np3): link status definitely down, disabling slave
Jan 29 06:43:37 node03 kernel: bond1: (slave eno12409np1): making interface the new active one
Jan 29 06:43:37 node03 kernel: ice 0000:22:00.3 eno12429np3: left promiscuous mode
Jan 29 06:43:37 node03 kernel: ice 0000:22:00.3 eno12429np3: left allmulticast mode
Jan 29 06:43:37 node03 kernel: ice 0000:22:00.1 eno12409np1: entered promiscuous mode
Jan 29 06:43:37 node03 kernel: ice 0000:22:00.1 eno12409np1: entered allmulticast mode
Jan 29 06:43:38 node03 kernel: ice 0000:22:00.3: PTP reset successful
Jan 29 06:43:41 node03 kernel: ice 0000:22:00.1 irdma1: ICE OICR event notification: oicr = 0x04000003
Jan 29 06:43:41 node03 kernel: ice 0000:22:00.1 irdma1: HMC Error
Jan 29 06:43:41 node03 kernel: ice 0000:22:00.1 irdma1: Requesting a reset
Jan 29 06:43:41 node03 kernel: ice 0000:22:00.3: VSI rebuilt. VSI index 0, type ICE_VSI_PF
Jan 29 06:43:41 node03 kernel: ice 0000:22:00.3: VSI rebuilt. VSI index 1, type ICE_VSI_CTRL
Jan 29 06:43:41 node03 kernel: bond1: (slave eno12429np3): link status definitely up, 10000 Mbps full duplex
Jan 29 06:43:41 node03 kernel: bond1: (slave eno12429np3): making interface the new active one
Jan 29 06:43:41 node03 kernel: ice 0000:22:00.1 eno12409np1: left promiscuous mode
Jan 29 06:43:41 node03 kernel: ice 0000:22:00.1 eno12409np1: left allmulticast mode
Jan 29 06:43:41 node03 kernel: ice 0000:22:00.3 eno12429np3: entered promiscuous mode
Jan 29 06:43:41 node03 kernel: ice 0000:22:00.3 eno12429np3: entered allmulticast mode
Jan 29 06:43:43 node03 kernel: bond1: (slave eno12409np1): link status definitely down, disabling slave
Jan 29 06:43:43 node03 kernel: ice 0000:22:00.1: PTP reset successful
Jan 29 06:43:45 node03 kernel: ice 0000:22:00.1: VSI rebuilt. VSI index 0, type ICE_VSI_PF
Jan 29 06:43:45 node03 kernel: ice 0000:22:00.1: VSI rebuilt. VSI index 1, type ICE_VSI_CTRL
Jan 29 06:43:45 node03 kernel: bond1: (slave eno12409np1): link status definitely up, 10000 Mbps full duplex


I have a bond setup so if there is a failure it should move the networking over to the 2nd nic (and it says it does) but all VMS on this node - just drop networking.
If I migrate them to a different node (the VMs are working)

Here are some more details on the Setup
3 Node Cluster - DELL R760 -
4 Nics Per Node - Intel(R) Ethernet 25G 4P E810-XXV OCP
Bond0 - Management - Nic0 and Nic2
Bond2 - VM Network - Nic1 and Nic3

When the ice Drivers fail - the Management Network is fine (only the VM network crashes)

I am on the latest Proxmox 8.3 (patched just today and still saw the issue)
 
Last edited: