ConnectX-4 Lx dual port with virtual functions - iommu group instability between reboots

Mar 25, 2026
1
0
1
Hi,

I've got a PVE cluster with Supermicro H13SSL-N single socket based servers with Epyc 9254 CPU, 2x dual port Intel XL710 NICs, and 1x dual port ConnectX-4 LX NIC.

I'm trying to use resource mappings to pass in the ConnectX-4 NIC's virtual functions.

The problem I'm having is every time I reboot, the iommu groups change for the NIC virtual functions for the ConnectX-4 card. This causes errors for resource mappings. VMs won't power on until i re-define the mapping so the iommu group gets updated.

The error message appears under Datacenter / Resource Mappings.
The Status field for the mapping on a host thats been rebooted says: "Configuration for iommu group is not correct ('97' != '59')"
Obviously the numbers can be different.

I'm using the default NIC driver (trying to avoid the pain of the ofed driver).
I've used mstconfig to configure the virtual functions:

mstconfig -d 0000:41:00.0 -y set SRIOV_EN=True NUM_OF_VFS=64

This works fine.

The PCI IDs for each virtual function look to be stable between boots, its just the iommu groups that seem to change.
I'm guessing it's non-deterministic during boot for which port and therefore VFs enumerate first, and perhaps this causes variability in the IOMMU group.

I've Tried changing a bunch of bios settings that are related to pci device passthrough etc, but its not changed the behaviour.

Device passthrough works perfectly fine if I use the PCI IDs, including for the intel XL710s that we're passing through for some production VMs.

Thanks!
 
@shaun_so - I have also seen my device tree enumerate differently between reboots using SR-IOV for dual ported NICs. I've seen this after upgrading through the last three kernel versions. I am now at 7.0.2-6-pve. For me it's on an Intel E810-XXV 25Gb NIC (using the ice and iavf drivers). I have made no other hardware changes between these kernel upgrades and/or reboots to warrant a different device tree. Also, the exact number of VFs has remained the same. Just like you, I have the VFs configured as Resource Mappings at the Proxmox Datacenter level in order to pass them through to a Mikrotik CHR VM, and see the same error as you.

With the most recent kernel cited above, after several reboots, the device tree seems to remain stable (for now).

Would suggest trying the most recent kernel, as of this post's date, to see if that makes a difference (along with a few reboots to test)....
 
Last edited: