[SOLVED] mcx4121a NIC drops after booting VM with sas2308 passthrough under

appid

New Member
Mar 9, 2024
4
0
1
Environment Details:
- PVE Version: 8.1.4
- Kernel Version: 6.5.13-1-pve
- Network Card Model: MCX4121A-ACAT
- sas2308 Model: LSI SAS 9217-8i

SysLog:
Code:
Mar 11 23:07:25 pve pvedaemon[1449]: <root@pam> end task UPID:pve:00001FE2:00033FFC:65EF1E2D:qmclone:1000:root@pam: OK
Mar 11 23:07:33 pve pvedaemon[1450]: <root@pam> successful auth for user 'root@pam'
Mar 11 23:07:55 pve pvedaemon[1449]: <root@pam> update VM 104: -hostpci0 mapping=sas2308,pcie=1
Mar 11 23:08:00 pve pvedaemon[8277]: start VM 104: UPID:pve:00002055:00034DBC:65EF1E50:qmstart:104:root@pam:
Mar 11 23:08:00 pve pvedaemon[1450]: <root@pam> starting task UPID:pve:00002055:00034DBC:65EF1E50:qmstart:104:root@pam:
Mar 11 23:08:00 pve kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
Mar 11 23:08:00 pve kernel: sd 0:0:0:0: [sda] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Mar 11 23:08:00 pve kernel: mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221107000000)
Mar 11 23:08:00 pve kernel: mpt2sas_cm0: removing handle(0x0009), sas_addr(0x4433221107000000)
Mar 11 23:08:00 pve kernel: mpt2sas_cm0: enclosure logical id(0x500605b009acb4c0), slot(4)
Mar 11 23:08:00 pve kernel: mpt2sas_cm0: sending message unit reset !!
Mar 11 23:08:00 pve kernel: mpt2sas_cm0: message unit reset: SUCCESS
Mar 11 23:08:00 pve kernel: mlx5_core 0000:01:00.0: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
Mar 11 23:08:00 pve kernel: mlx5_core 0000:01:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
Mar 11 23:08:05 pve kernel: vmbr0: port 1(enp1s0f0np0) entered disabled state
Mar 11 23:08:05 pve kernel: mlx5_core 0000:01:00.0 enp1s0f0np0 (unregistering): left allmulticast mode
Mar 11 23:08:05 pve kernel: mlx5_core 0000:01:00.0 enp1s0f0np0 (unregistering): left promiscuous mode
Mar 11 23:08:05 pve kernel: vmbr0: port 1(enp1s0f0np0) entered disabled state
Mar 11 23:08:05 pve kernel: mlx5_core 0000:01:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
Mar 11 23:08:06 pve kernel: mlx5_core 0000:01:00.0: E-Switch: cleanup
Mar 11 23:08:07 pve kernel: mlx5_core 0000:01:00.1: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
Mar 11 23:08:07 pve kernel: mlx5_core 0000:01:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
Mar 11 23:08:11 pve kernel: mlx5_core 0000:01:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), necvfs(0), active vports(0)
Mar 11 23:08:12 pve kernel: mlx5_core 0000:01:00.1: E-Switch: cleanup
Mar 11 23:08:13 pve systemd[1]: Started 104.scope.

Replication process:
1. Created vmbr0 bridged to enp1s0f0np0 (mcx4121a network card port 1).
2. Created a Resource Mapping named "sas2308."
3. VM 104 was directly assigned the PCIe device sas2308 and started.
4. Checked syslog, and after VM 104 started, mlx5_core began reporting errors, and the network card went offline.

Other Notes:
1. Tested direct assignment of the onboard i-210 network card, which did not cause mlx5_core errors. No additional PCIe devices are available for further testing.
2. Starting a virtual machine with the PCIe device sas2308 directly assigned does not cause mlx5_core errors when starting one or more virtual machines using vmbr0 bridged network cards.
 
Last edited:
Check your iOMMU groups. Devices in the same group cannot be shared between VMs and/or the Proxmox host. This comes up regularly on the forum.
 
Check your iOMMU groups. Devices in the same group cannot be shared between VMs and/or the Proxmox host. This comes up regularly on the forum.
How to check the iOMMU group,
I checked the pci id of the device:
sas2308:0000:02:00:0
mcx4121a:0000:01:00:0
 
How to check the iOMMU group,
Look in the IOMMU column in the Proxmox web GUI when selecting the (raw) device to passthrough.
Or run this command pvesh get /nodes/NODENAME/hardware/pci --pci-class-blacklist "" where NODENAME is your Proxmox server/node name, and look in the iommugroup.
Or run for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done and look.
Or follow the Proxmox Wiki page: https://pve.proxmox.com/wiki/PCI_Passthrough#Verify_IOMMU_isolation
 
Look in the IOMMU column in the Proxmox web GUI when selecting the (raw) device to passthrough.
Or run this command pvesh get /nodes/NODENAME/hardware/pci --pci-class-blacklist "" where NODENAME is your Proxmox server/node name, and look in the iommugroup.
Or run for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done and look.
Or follow the Proxmox Wiki page: https://pve.proxmox.com/wiki/PCI_Passthrough#Verify_IOMMU_isolation
Thanks, based on your info I found the cause of the problem