Hi all,
you're my last hope: I have a few HP-branded Mellanox NICs (2 ConnectX-2 and this ConnectX-3). For the past two weeks I tried to make passthrough work, to no avail. I mainly want this in pfSense, but it doesn't work at all.
The NICs are all recognized by the host OS, but as soon as I try to pass them through to a guest (any guest) I end up with kernel panics.
The system is a Ryzen 5700G, 64GB RAM running on an AsRock "Fatal1ty B450 Gaming-ITX/ac" (this one, not the newer K4).
I made sure to go through https://pve.proxmox.com/wiki/Pci_passthrough#PCI_EXPRESS_PASSTHROUGH and check:
- that IOMMU is enabled, both in BIOS and GRUB (
- that all the correct modules are loaded (they're both in
- that IOMMU Interrupt Remapping is present
- that IOMMU Isolation is on (see bottom of the post)
- that the card is detected properly (it works on the host, so...)
pfSense panics identically with both ConnectX-2 and -3, saying
The same happens on a live FreeBSD as soon as I load the Mellanox
I've done an ungodly amount of both Google searches and trial and error for this (doesn't help that it's my main router, meaning I have no network and have to either tether or use my phone), but while I initially thought it was a FreeBSD/pfSense error, which led me down the rabbit hole of updating and reconfiguring the card firmware (spoiler: didn't work. Also I only did this on one card, the others are still "normal"), it doesn't seem to be a card problem.
In the meantime I also checked the cards firmware versions and they're all as recent as they can.
I've tried all combinations of settings in the passthrough menu (PCIe yes/no, all functions and not, ROM BAR on/off) but I still get the same error every time.
I tried clearing CMOS, disabling various settings on the cards themselves (via the system BIOS and the card's boot-time configurator), changing machine type.
In the BIOS I turned Resizable BAR on and off, SR-IOV on and off, PCIE ARI on and off, disabled XMP and probably other things I forgot.
I disabled "HP Shared Memory" in the BIOS for the NIC (didn't even know that this was a thing)
This email on kernel.org's mailing list suggested this could be due to Message Signaled Interrupts (MSI) but they're enabled in my guests (AFAICT) and I believe they are in
This does seem to be some sort of problem with interrupts though (per Windows' error message), one that I've seen reported in these forums by another person (here), but with no solution.
Is there anyone with a similar setup that has had more luck in getting these cards to passthrough and can offer a solution? I'm happy to try new things that I haven't tried yet (but what I'm really hoping for is someone that says "oh yes I had this issue, it's this stupid flag here that you need to change").
Thanks!
IOMMU groups:
you're my last hope: I have a few HP-branded Mellanox NICs (2 ConnectX-2 and this ConnectX-3). For the past two weeks I tried to make passthrough work, to no avail. I mainly want this in pfSense, but it doesn't work at all.
The NICs are all recognized by the host OS, but as soon as I try to pass them through to a guest (any guest) I end up with kernel panics.
The system is a Ryzen 5700G, 64GB RAM running on an AsRock "Fatal1ty B450 Gaming-ITX/ac" (this one, not the newer K4).
I made sure to go through https://pve.proxmox.com/wiki/Pci_passthrough#PCI_EXPRESS_PASSTHROUGH and check:
- that IOMMU is enabled, both in BIOS and GRUB (
amd_iommu=on
and both with and without iommu=pt
)- that all the correct modules are loaded (they're both in
/etc/modules
and verified loaded via lsmod
)- that IOMMU Interrupt Remapping is present
Bash:
# dmesg | grep -i remapping
[ 0.445373] AMD-Vi: Interrupt remapping enabled
- that the card is detected properly (it works on the host, so...)
pfSense panics identically with both ConnectX-2 and -3, saying
Code:
mlx4_core0: command 0x23 timed out (go bit not cleared)
mlx4_core0: device is going to be reset
mlx4_core0: device was reset successfully
mlx4_core0: Failed to initialize queue pair table, aborting
Fatal trap 12: page fault while in kernel mode
[rest of the stack trace]
The same happens on a live FreeBSD as soon as I load the Mellanox
mlx4en
kernel module, Debian VMs give some similar errors that scroll by too fast for me to screenshot and in Windows 10, when I try to install the module, the setup hangs and Event Viewer > Security is full of mlx4_bus
errors saying
Code:
Native_6_16_0: Lost interrupt was detected, inserting DPC to process EQE.
EQE found on EQ index: 4
Number of ETH EQs: 4
I've done an ungodly amount of both Google searches and trial and error for this (doesn't help that it's my main router, meaning I have no network and have to either tether or use my phone), but while I initially thought it was a FreeBSD/pfSense error, which led me down the rabbit hole of updating and reconfiguring the card firmware (spoiler: didn't work. Also I only did this on one card, the others are still "normal"), it doesn't seem to be a card problem.
In the meantime I also checked the cards firmware versions and they're all as recent as they can.
I've tried all combinations of settings in the passthrough menu (PCIe yes/no, all functions and not, ROM BAR on/off) but I still get the same error every time.
I tried clearing CMOS, disabling various settings on the cards themselves (via the system BIOS and the card's boot-time configurator), changing machine type.
In the BIOS I turned Resizable BAR on and off, SR-IOV on and off, PCIE ARI on and off, disabled XMP and probably other things I forgot.
I disabled "HP Shared Memory" in the BIOS for the NIC (didn't even know that this was a thing)
This email on kernel.org's mailing list suggested this could be due to Message Signaled Interrupts (MSI) but they're enabled in my guests (AFAICT) and I believe they are in
pve-kernel
too, or it wouldn't detect the card... Right?This does seem to be some sort of problem with interrupts though (per Windows' error message), one that I've seen reported in these forums by another person (here), but with no solution.
Is there anyone with a similar setup that has had more luck in getting these cards to passthrough and can offer a solution? I'm happy to try new things that I haven't tried yet (but what I'm really hoping for is someone that says "oh yes I had this issue, it's this stupid flag here that you need to change").
Thanks!
IOMMU groups:
Bash:
# find /sys/kernel/iommu_groups/ -type l
/sys/kernel/iommu_groups/3/devices/0000:00:14.3
/sys/kernel/iommu_groups/3/devices/0000:00:14.0
/sys/kernel/iommu_groups/1/devices/0000:03:00.0
/sys/kernel/iommu_groups/1/devices/0000:09:00.0
/sys/kernel/iommu_groups/1/devices/0000:02:00.2
/sys/kernel/iommu_groups/1/devices/0000:02:00.0
/sys/kernel/iommu_groups/1/devices/0000:03:06.0
/sys/kernel/iommu_groups/1/devices/0000:08:00.0
/sys/kernel/iommu_groups/1/devices/0000:00:02.1
/sys/kernel/iommu_groups/1/devices/0000:0a:00.0
/sys/kernel/iommu_groups/1/devices/0000:03:05.0
/sys/kernel/iommu_groups/1/devices/0000:02:00.1
/sys/kernel/iommu_groups/1/devices/0000:03:01.0
/sys/kernel/iommu_groups/1/devices/0000:00:02.2
/sys/kernel/iommu_groups/1/devices/0000:03:04.0
/sys/kernel/iommu_groups/1/devices/0000:00:02.0
/sys/kernel/iommu_groups/1/devices/0000:03:07.0
/sys/kernel/iommu_groups/4/devices/0000:00:18.3
/sys/kernel/iommu_groups/4/devices/0000:00:18.1
/sys/kernel/iommu_groups/4/devices/0000:00:18.6
/sys/kernel/iommu_groups/4/devices/0000:00:18.4
/sys/kernel/iommu_groups/4/devices/0000:00:18.2
/sys/kernel/iommu_groups/4/devices/0000:00:18.0
/sys/kernel/iommu_groups/4/devices/0000:00:18.7
/sys/kernel/iommu_groups/4/devices/0000:00:18.5
/sys/kernel/iommu_groups/2/devices/0000:0c:00.0
/sys/kernel/iommu_groups/2/devices/0000:00:08.0
/sys/kernel/iommu_groups/2/devices/0000:0b:00.2
/sys/kernel/iommu_groups/2/devices/0000:0b:00.0
/sys/kernel/iommu_groups/2/devices/0000:0c:00.1
/sys/kernel/iommu_groups/2/devices/0000:00:08.1
/sys/kernel/iommu_groups/2/devices/0000:0b:00.3
/sys/kernel/iommu_groups/2/devices/0000:0b:00.1
/sys/kernel/iommu_groups/2/devices/0000:0b:00.6
/sys/kernel/iommu_groups/2/devices/0000:00:08.2
/sys/kernel/iommu_groups/2/devices/0000:0b:00.4
/sys/kernel/iommu_groups/0/devices/0000:00:01.0
/sys/kernel/iommu_groups/0/devices/0000:01:00.0
/sys/kernel/iommu_groups/0/devices/0000:00:01.1