Recurrent PVE node crashes after enabling IOMMU on Supermicro AMD hosts with Intel X710 (i40e)
Hello,I am looking for feedback from anyone who may have run into a similar issue.
We have two Proxmox VE nodes built on very similar Supermicro AMD platforms. Since enabling IOMMU in the BIOS, both nodes have started crashing after some uptime. The failure pattern is very similar on both systems: the nodes first start showing IOMMU-related errors, then the Intel X710 interfaces begin to misbehave, the bond loses connectivity, and the host eventually becomes isolated.
What makes this particularly confusing is that the systems appear stable when IOMMU is disabled, but in that case we do not get access to the full CPU core count we expect from the platform. So at the moment we are stuck between stability with reduced resources, or full resources with recurring crashes.
I would be very interested to know whether anyone here has seen something similar with AMD hosts + IOMMU + Intel X710 / i40e, or has any suggestions on where to look next.
Environment
- Proxmox VE: 9.1
- Kernel: 6.17.13-1-pve
- Platform: Supermicro AMD-based servers
- CPU : AMD EPYC 9654 96-Core Processor (2 Sockets)
- NIC: Intel X710 4-port 10GbE
- Driver: i40e
- Firmware / BIOS: fully up to date
- Both nodes were also reinstalled from scratch
- We recently enabled IOMMU in Supermicro BIOS in order to expose/use the full CPU core set ( 384 ), and since then the nodes started crashing after some uptime
- When IOMMU is disabled in the BIOS, the nodes remain stable, but we can only see 224 cores out of the 384.
What we see in logs
On both nodes, the pattern is very similar:- First signal is AMD-Vi / IOMMU IO_PAGE_FAULT
- Then the Intel X710 / i40estarts showing errors like:
- LIBIE_AQ_RC_ENOSPC
- promiscuous mode forced on
- tx_timeout
- capability discovery failed ... -EIO
- Then one of the production interfaces hits:
- NETDEV WATCHDOG
- Then the bond loses slaves / connectivity
- Then the node becomes isolated and all higher-level services start failing
Why we suspect X710 / i40e / IOMMU interaction
What makes this suspicious is:- the same fault pattern happens on two separate nodes
- the earliest meaningful errors are AMD-Vi IO_PAGE_FAULT events
- the later outage affects multiple PCI functions of the same X710 controller
- reboot restores the card/node temporarily
- this does not look like a pure switch/LACP issue
Examples of the recurring symptoms
We repeatedly see combinations like:
Code:
AMD-Vi: Event logged [IO_PAGE_FAULT ...]
AMD-Vi: IOMMU Event log restarting
i40e ... LIBIE_AQ_RC_ENOSPC
i40e ... promiscuous mode forced on
i40e ... NETDEV WATCHDOG
i40e ... capability discovery failed, err -EIO
bond0: now running without any active interface!
What we already did
- verified BIOS / firmware are current
- reinstalled the nodes from scratch
- confirmed the issue is reproducible across more than one incident
- confirmed the failures involve the same X710 controller family on both nodes
Questions :
Has anyone seen similar instability with:- Intel X710
- i40e
- AMD hosts
- IOMMU / AMD-Vi enabled
- Have you seen IO_PAGE_FAULT events on X710 followed by NETDEV WATCHDOG / bond collapse?
- Did changing IOMMU mode help (iommu=pt, disabling IOMMU, etc.)?
- Did a different kernel or newer/older i40e behavior make this stable?
- Is this a known issue with X710 under AMD IOMMU-translated DMA mode?
- Did anyone end up solving this only by moving traffic off X710 or replacing the adapter family?
Any feedback from people running similar hardware would be very helpful.
Thanks.