Recurrent PVE node crashes after enabling IOMMU on Supermicro AMD hosts with Intel X710 (i40e)

Unpicked Headsman · 2026-03-18T16:58:45+0100

Recurrent PVE node crashes after enabling IOMMU on Supermicro AMD hosts with Intel X710 (i40e)

Hello,

I am looking for feedback from anyone who may have run into a similar issue.

We have two Proxmox VE nodes built on very similar Supermicro AMD platforms. Since enabling IOMMU in the BIOS, both nodes have started crashing after some uptime. The failure pattern is very similar on both systems: the nodes first start showing IOMMU-related errors, then the Intel X710 interfaces begin to misbehave, the bond loses connectivity, and the host eventually becomes isolated.

What makes this particularly confusing is that the systems appear stable when IOMMU is disabled, but in that case we do not get access to the full CPU core count we expect from the platform. So at the moment we are stuck between stability with reduced resources, or full resources with recurring crashes.

I would be very interested to know whether anyone here has seen something similar with AMD hosts + IOMMU + Intel X710 / i40e, or has any suggestions on where to look next.

Environment

Proxmox VE: 9.1
Kernel: 6.17.13-1-pve
Platform: Supermicro AMD-based servers
CPU : AMD EPYC 9654 96-Core Processor (2 Sockets)
NIC: Intel X710 4-port 10GbE
Driver: i40e
Firmware / BIOS: fully up to date
Both nodes were also reinstalled from scratch
We recently enabled IOMMU in Supermicro BIOS in order to expose/use the full CPU core set ( 384 ), and since then the nodes started crashing after some uptime
When IOMMU is disabled in the BIOS, the nodes remain stable, but we can only see 224 cores out of the 384.

What we see in logs

On both nodes, the pattern is very similar:

First signal is AMD-Vi / IOMMU IO_PAGE_FAULT
Then the Intel X710 / i40estarts showing errors like:
- LIBIE_AQ_RC_ENOSPC
- promiscuous mode forced on
- tx_timeout
- capability discovery failed ... -EIO
Then one of the production interfaces hits:
- NETDEV WATCHDOG
Then the bond loses slaves / connectivity
Then the node becomes isolated and all higher-level services start failing

Why we suspect X710 / i40e / IOMMU interaction

What makes this suspicious is:

the same fault pattern happens on two separate nodes
the earliest meaningful errors are AMD-Vi IO_PAGE_FAULT events
the later outage affects multiple PCI functions of the same X710 controller
reboot restores the card/node temporarily
this does not look like a pure switch/LACP issue

Examples of the recurring symptoms

We repeatedly see combinations like:

Code:

AMD-Vi: Event logged [IO_PAGE_FAULT ...]
AMD-Vi: IOMMU Event log restarting
i40e ... LIBIE_AQ_RC_ENOSPC
i40e ... promiscuous mode forced on
i40e ... NETDEV WATCHDOG
i40e ... capability discovery failed, err -EIO
bond0: now running without any active interface!

What we already did

verified BIOS / firmware are current
reinstalled the nodes from scratch
confirmed the issue is reproducible across more than one incident
confirmed the failures involve the same X710 controller family on both nodes

Questions :

Has anyone seen similar instability with:

Intel X710
i40e
AMD hosts
IOMMU / AMD-Vi enabled

More specifically:

Have you seen IO_PAGE_FAULT events on X710 followed by NETDEV WATCHDOG / bond collapse?
Did changing IOMMU mode help (iommu=pt, disabling IOMMU, etc.)?
Did a different kernel or newer/older i40e behavior make this stable?
Is this a known issue with X710 under AMD IOMMU-translated DMA mode?
Did anyone end up solving this only by moving traffic off X710 or replacing the adapter family?

At this point, the most likely explanation from our side is a controller/driver/IOMMU interaction, rather than a simple network configuration issue.
Any feedback from people running similar hardware would be very helpful.

Thanks.

Search

Search

Recurrent PVE node crashes after enabling IOMMU on Supermicro AMD hosts with Intel X710 (i40e)

Unpicked Headsman

New Member