Mellanox driver in memory conflict with amdgpu?

wrobelda · Feb 27, 2024

I have a Mellanox ConnectX LX configured with 8 VFs, some of which are passed through to my VMs. This worked without issues so far, until today.

I added a Radeon RX 6600 XT GPU, with intention to pass it through a Windows 11 VM. The first issue noticed is that the Mellanox devices were reassigned to a different 05:00.0 ID from the previous 02:00:0. The VMs with the passed through VFs didn't work and I got a flood of x86/PAT: kvm:21286 conflicting memory types 401c900000-401ca00000 uncached-minus<->write-combining messages in dmesg, with some other issues on top:

Code:

[  756.298607] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
[  756.320050] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[  756.320314] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
[  756.320451] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000013, smu fw program = 0, version = 0x003b2f00 (59.47.0)
[  756.320725] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
[  756.329366] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
[  756.330774] [drm] DMUB hardware initialized: version=0x02020020
[  756.352468] [drm] kiq ring mec 2 pipe 1 q 0
[  756.356247] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[  756.356554] [drm] JPEG decode initialized successfully.
[  756.356697] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[  756.356827] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[  756.356954] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[  756.357078] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[  756.357201] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[  756.357322] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[  756.357438] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[  756.357553] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[  756.357667] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[  756.357776] amdgpu 0000:03:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[  756.357882] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[  756.357988] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[  756.358089] amdgpu 0000:03:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
[  756.358191] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
[  756.358293] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
[  756.358392] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[  756.361537] amdgpu 0000:03:00.0: [drm] Cannot find any crtc or sizes
[  756.376259] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
[  756.495649] [drm] amdgpu: ttm finalized
[  776.684228] vfio-pci 0000:05:01.3: enabling device (0000 -> 0002)
[  776.800538] x86/PAT: kvm:21286 conflicting memory types 401c900000-401ca00000 uncached-minus<->write-combining
[  776.800734] x86/PAT: memtype_reserve failed [mem 0x401c900000-0x401c9fffff], track uncached-minus, req uncached-minus
[  776.800928] ioremap memtype_reserve failed -16
[  777.066986] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
[  777.307035] sd 1:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[  777.390947] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[  777.631008] sd 0:0:0:0: [sda] Synchronize Cache(10) failed: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[  778.307553] irq 16: nobody cared (try booting with the "irqpoll" option)
[  778.307671] CPU: 11 PID: 0 Comm: swapper/11 Tainted: P           O       6.5.11-8-pve #1
[  778.307943] Hardware name: HP HP Z2 Tower G9 Workstation Desktop PC/895C, BIOS U50 Ver. 02.04.02 11/06/2023
[  778.308234] Call Trace:
[  778.308383]  <IRQ>
[  778.308533]  dump_stack_lvl+0x48/0x70
[  778.308685]  dump_stack+0x10/0x20
[  778.308832]  __report_bad_irq+0x30/0xd0
[  778.308979]  note_interrupt+0x2e1/0x320
[  778.309126]  handle_irq_event+0x79/0x80
[  778.309271]  handle_fasteoi_irq+0x7d/0x200
[  778.309418]  __common_interrupt+0x43/0xd0
[  778.309565]  common_interrupt+0x9f/0xb0
[  778.309707]  </IRQ>
[  778.309846]  <TASK>
[  778.309983]  asm_common_interrupt+0x27/0x40
[  778.310119] RIP: 0010:cpuidle_enter_state+0xce/0x470
[  778.310255] Code: 28 10 ff e8 64 f6 ff ff 8b 53 04 49 89 c6 0f 1f 44 00 00 31 ff e8 22 25 0f ff 80 7d d7 00 0f 85 e7 01 00 00 fb 0f 1f 44 00 00 <45> 85 ff 0f 88 83 01 00 00 49 63 d7 4c 89 f1 48 8d 04 52 48 8d 04
[  778.310660] RSP: 0018:ffffc18a8018fe50 EFLAGS: 00000246
[  778.310801] RAX: 0000000000000000 RBX: ffffe18a7fcc0930 RCX: 0000000000000000
[  778.310947] RDX: 000000000000000b RSI: 0000000000000000 RDI: 0000000000000000
[  778.311088] RBP: ffffc18a8018fe88 R08: 0000000000000000 R09: 0000000000000000
[  778.311225] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004
[  778.311379] R13: ffffffff99e690e0 R14: 000000b536bec4f4 R15: 0000000000000004
[  778.311528]  cpuidle_enter+0x2e/0x50
[  778.311658]  call_cpuidle+0x23/0x60
[  778.311784]  do_idle+0x202/0x260
[  778.311905]  cpu_startup_entry+0x2a/0x30
[  778.312021]  start_secondary+0x119/0x140
[  778.312133]  secondary_startup_64_no_verify+0x17e/0x18b
[  778.312243]  </TASK>
[  778.312349] handlers:
[  778.312458] [<00000000a84d1531>] vfio_intx_handler [vfio_pci_core]
[  778.312580] Disabling IRQ #16

Now, I checked the /proc/iomem and didn't see any conflicts:

Code:

00000000-00000fff : Reserved
00001000-0009efff : System RAM
0009f000-000fffff : Reserved
  000a0000-000bffff : PCI Bus 0000:00
  000f0000-000fffff : System ROM
00100000-4783a017 : System RAM
4783a018-4784e257 : System RAM
4784e258-56fc7fff : System RAM
56fc8000-5703dfff : Reserved
5703e000-621dafff : System RAM
621db000-65d5dfff : Reserved
65d5e000-65f5dfff : ACPI Non-volatile Storage
65f5e000-65ffefff : ACPI Tables
65fff000-65ffffff : System RAM
66000000-69ffffff : Reserved
6b200000-6b3fffff : Reserved
6bc00000-807fffff : Reserved
80800000-bfffffff : PCI Bus 0000:00
  80a00000-80cfffff : PCI Bus 0000:01
    80a00000-80bfffff : PCI Bus 0000:02
      80a00000-80bfffff : PCI Bus 0000:03
        80a00000-80afffff : 0000:03:00.0
        80b00000-80b03fff : 0000:03:00.1
          80b00000-80b03fff : ICH HD audio
        80b20000-80b3ffff : 0000:03:00.0
    80c00000-80c03fff : 0000:01:00.0
  80d00000-80d00fff : 0000:00:1f.5
  80e00000-80e1ffff : 0000:00:1f.6
    80e00000-80e1ffff : vfio-pci
  80e20000-80e21fff : 0000:00:17.0
    80e20000-80e21fff : vfio-pci
  80e23000-80e237ff : 0000:00:17.0
    80e23000-80e237ff : vfio-pci
  80e23800-80e23fff : vfio sub-page reserved
  80e24000-80e240ff : 0000:00:17.0
    80e24000-80e240ff : vfio-pci
  80e24100-80e24fff : vfio sub-page reserved
  80f00000-810fffff : PCI Bus 0000:05
    80f00000-80ffffff : 0000:05:00.0
    81000000-810fffff : 0000:05:00.1
  81100000-817fffff : PCI Bus 0000:04
    81100000-813fffff : 0000:04:00.0
    81400000-816fffff : 0000:04:00.0
    81700000-81703fff : 0000:04:00.0
      81700000-81703fff : nvme
    81704000-8170ffff : 0000:04:00.0
  81800000-81efffff : PCI Bus 0000:06
    81800000-81afffff : 0000:06:00.0
    81b00000-81dfffff : 0000:06:00.0
    81e00000-81e03fff : 0000:06:00.0
      81e00000-81e03fff : nvme
    81e04000-81e0ffff : 0000:06:00.0
c0000000-cfffffff : PCI MMCONFIG 0000 [bus 00-ff]
fec00000-fec003ff : IOAPIC 0
fed00000-fed003ff : HPET 0
  fed00000-fed003ff : PNP0103:00
fed20000-fed7ffff : Reserved
  fed40000-fed44fff : IFX1521:00
    fed40000-fed44fff : IFX1521:00 IFX1521:00
fed90000-fed90fff : dmar0
fed91000-fed91fff : dmar1
feda0000-feda0fff : pnp 00:05
feda1000-feda1fff : pnp 00:05
fedb0000-fedbffff : pnp 00:04
fedc0000-fedc7fff : pnp 00:05
fee00000-feefffff : pnp 00:05
  fee00000-fee00fff : Local APIC
100000000-87f7fffff : System RAM
  3fe600000-3ff9fffff : Kernel code
  3ffa00000-400674fff : Kernel rodata
  400800000-400b7febf : Kernel data
  40102f000-4021fffff : Kernel bss
87f800000-87fffffff : RAM buffer
4000000000-7fffffffff : PCI Bus 0000:00
  4000000000-400fffffff : 0000:00:02.0
  4010000000-4016ffffff : 0000:00:02.0
  4018000000-401dffffff : PCI Bus 0000:05
    4018000000-4019ffffff : 0000:05:00.0
      4018000000-4019ffffff : mlx5_core
    401a000000-401bffffff : 0000:05:00.1
      401a000000-401bffffff : mlx5_core
    401c000000-401c7fffff : 0000:05:00.0
    401c800000-401cffffff : 0000:05:00.1
      401c800000-401c8fffff : 0000:05:01.2
      401c900000-401c9fffff : 0000:05:01.3
      401ca00000-401cafffff : 0000:05:01.4
        401ca00000-401cafffff : mlx5_core
      401cb00000-401cbfffff : 0000:05:01.5
        401cb00000-401cbfffff : mlx5_core
      401cc00000-401ccfffff : 0000:05:01.6
        401cc00000-401ccfffff : mlx5_core
      401cd00000-401cdfffff : 0000:05:01.7
        401cd00000-401cdfffff : mlx5_core
      401ce00000-401cefffff : 0000:05:02.0
        401ce00000-401cefffff : mlx5_core
      401cf00000-401cffffff : 0000:05:02.1
        401cf00000-401cffffff : mlx5_core
  4020000000-40ffffffff : 0000:00:02.0
  6000000000-620fffffff : PCI Bus 0000:01
    6000000000-620fffffff : PCI Bus 0000:02
      6000000000-620fffffff : PCI Bus 0000:03
        6000000000-61ffffffff : 0000:03:00.0
        6200000000-620fffffff : 0000:03:00.0
  6214000000-6214ffffff : 0000:00:02.0
  6215000000-621500ffff : 0000:00:14.0
    6215000000-621500ffff : xhci-hcd
  6215010000-6215013fff : 0000:00:14.2
  6215014000-62150140ff : 0000:00:1f.4
  6215016000-6215016fff : 0000:00:14.2

Additionally, I am getting Error 43 in Windows 11, where the GPU is passed through. No other errors whatsoever in dmesg when the VM is initialized.

I can get the dmesg flood stop by blacklisting amdgpu driver, even though it is supposed to be enabled for passing through, at least according some posts I saw here on the forums (e.g. https://forum.proxmox.com/threads/problem-with-gpu-passthrough.55918/post-483749).

Now, I found in some other sources (e.g. https://forum.level1techs.com/t/am5...x-cpu-2-kvm-conflicting-memory-types/201555/4) that it is recommended to disable Mellanox's VF autoprobing in case of memory type conflicts, which I did, but the result is the VMs will no longer start, unable to bind to PCI node. When issuing the kvm command directly:

/usr/bin/kvm -id 127 -name 'TrueNAS,debug-threads=on' -no-shutdown -chardev 'socket,id=qmp,path=/var/run/qemu-server/127.qmp,server=on,wait=off' -mon 'chardev=qmp,mode=control' -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' -mon 'chardev=qmp-event,mode=control' -pidfile /var/run/qemu-server/127.pid -daemonize -smbios 'type=1,uuid=744c0128-708d-4c5c-b6a1-93afe96aaf05' -smp '16,sockets=1,cores=16,maxcpus=16' -nodefaults -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' -vnc 'unix:/var/run/qemu-server/127.vnc,password=on' -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt -m 6144 -object 'iothread,id=iothread-virtio0' -readconfig /usr/share/qemu-server/pve-q35-4.0.cfg -device 'vmgenid,guid=079b0904-e321-49aa-9c55-c338a4881fb0' -device 'usb-tablet,id=tablet,bus=ehci.0,port=1' -device 'vfio-pci,host=0000:00:17.0,id=hostpci0,bus=pci.0,addr=0x10' -device 'vfio-pci,host=0000:05:01.3,id=hostpci1,bus=pci.0,addr=0x11' -device 'usb-host,vendorid=0x152d,productid=0x1561,id=usb0' -device 'usb-host,vendorid=0x7825,productid=0xa2a4,id=usb1' -device 'VGA,id=vga,bus=pcie.0,addr=0x1' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3,free-page-reporting=on' -iscsi 'initiator-name=iqn.1993-08.org.debian:01:5c218c373c6d' -drive 'file=/var/lib/vz/images/127/vm-127-disk-0.qcow2,if=none,id=drive-virtio0,discard=on,format=qcow2,cache=none,aio=io_uring,detect-zeroes=unmap' -device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,iothread=iothread-virtio0,bootindex=100' -machine 'type=q35+pve0'

I am seeing:

kvm: -device vfio-pci,host=0000:05:01.3,id=hostpci1,bus=pci.0,addr=0x11: vfio 0000:05:01.3: failed to open /dev/vfio/18: No such file or directory

Nowhere did I see anyone mentioning a requirement to additionally configure vfio-pci before passing to a VM when autoprobing is disabled. It's supposed to just work. Irregardless, after comparing the dmesg between autoprobing enabled and disabled and can see that in the former dmesg, enabling those VFs for VFIO-PCI is actually causing the conflicting memory type issues:

[ 20.199492] vfio-pci 0000:05:01.2: enabling device (0000 -> 0002)
[ 20.315487] x86/PAT: kvm:1981 conflicting memory types 401c800000-401c900000 uncached-minus<->write-combining
[ 20.315504] x86/PAT: memtype_reserve failed [mem 0x401c8

This makes me conclude that disabling autoprobing wouldn't really help here, since it would fail as soon as I had those VFs initialized for VFIO by hand (via its mobile parameter) — let me remind you here that this works just fine without amdgpu module and the latter seems to be the actual source of conflict (although I still don't understand why disabling autoprobing stops VMs from using VFs).

I would appreciate any help here, I am pulling hair.

So extra info:

SysFS:

class/net/ens4f1np1/device/sriov_drivers_autoprobe = 0
class/net/ens4f1np1/device/sriov_numvfs = 8

/etc/kernel/cmdline:

root=ZFS=rpool/ROOT/pve-1 boot=zfs intel_iommu=on iommu=pt pci=assign-busses vfio-pci.ids=1002:1478,1002:1479 initcall_blacklist=sysfb_init disable_vga=1 module_blacklist=amdgpu

wrobelda · Feb 27, 2024

lspci -nnk (with autoprobe = 0 and module_blacklist=amdgpu):

Code:

00:00.0 Host bridge [0600]: Intel Corporation Device [8086:4648] (rev 02)
    Subsystem: Hewlett-Packard Company Device [103c:895c]
00:01.0 PCI bridge [0604]: Intel Corporation 12th Gen Core Processor PCI Express x16 Controller #1 [8086:460d] (rev 02)
    Subsystem: Hewlett-Packard Company 12th Gen Core Processor PCI Express x16 Controller [103c:895c]
    Kernel driver in use: pcieport
00:02.0 Display controller [0380]: Intel Corporation AlderLake-S GT1 [8086:4680] (rev 0c)
    DeviceName: Onboard IGD
    Subsystem: Hewlett-Packard Company AlderLake-S GT1 [103c:895c]
    Kernel driver in use: i915
    Kernel modules: i915
00:14.0 USB controller [0c03]: Intel Corporation Alder Lake-S PCH USB 3.2 Gen 2x2 XHCI Controller [8086:7ae0] (rev 11)
    Subsystem: Hewlett-Packard Company Alder Lake-S PCH USB 3.2 Gen 2x2 XHCI Controller [103c:895c]
    Kernel driver in use: xhci_hcd
    Kernel modules: xhci_pci
00:14.2 RAM memory [0500]: Intel Corporation Alder Lake-S PCH Shared SRAM [8086:7aa7] (rev 11)
    Subsystem: Hewlett-Packard Company Alder Lake-S PCH Shared SRAM [103c:895c]
00:17.0 SATA controller [0106]: Intel Corporation Alder Lake-S PCH SATA Controller [AHCI Mode] [8086:7ae2] (rev 11)
    Subsystem: Hewlett-Packard Company Alder Lake-S PCH SATA Controller [AHCI Mode] [103c:895c]
    Kernel driver in use: vfio-pci
    Kernel modules: ahci
00:1b.0 PCI bridge [0604]: Intel Corporation Device [8086:7ac4] (rev 11)
    Subsystem: Hewlett-Packard Company Device [103c:895c]
    Kernel driver in use: pcieport
00:1c.0 PCI bridge [0604]: Intel Corporation Alder Lake-S PCH PCI Express Root Port #1 [8086:7ab8] (rev 11)
    Subsystem: Hewlett-Packard Company Alder Lake-S PCH PCI Express Root Port [103c:895c]
    Kernel driver in use: pcieport
00:1d.0 PCI bridge [0604]: Intel Corporation Alder Lake-S PCH PCI Express Root Port #13 [8086:7ab4] (rev 11)
    Subsystem: Hewlett-Packard Company Alder Lake-S PCH PCI Express Root Port [103c:895c]
    Kernel driver in use: pcieport
00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:7a88] (rev 11)
    Subsystem: Hewlett-Packard Company Device [103c:895c]
    Kernel driver in use: vfio-pci
00:1f.4 SMBus [0c05]: Intel Corporation Alder Lake-S PCH SMBus Controller [8086:7aa3] (rev 11)
    Subsystem: Hewlett-Packard Company Alder Lake-S PCH SMBus Controller [103c:895c]
    Kernel driver in use: vfio-pci
    Kernel modules: i2c_i801
00:1f.5 Serial bus controller [0c80]: Intel Corporation Alder Lake-S PCH SPI Controller [8086:7aa4] (rev 11)
    Subsystem: Hewlett-Packard Company Alder Lake-S PCH SPI Controller [103c:895c]
    Kernel driver in use: vfio-pci
    Kernel modules: spi_intel_pci
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (17) I219-LM [8086:1a1c] (rev 11)
    DeviceName: Onboard Lan
    Subsystem: Hewlett-Packard Company Ethernet Connection (17) I219-LM [103c:895c]
    Kernel driver in use: vfio-pci
    Kernel modules: e1000e
01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c1)
    Kernel driver in use: pcieport
02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
    Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
    Kernel driver in use: pcieport
03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c1)
    Subsystem: ASUSTeK Computer Inc. Navi 23 [Radeon RX 6600/6600 XT/6600M] [1043:05d3]
    Kernel modules: amdgpu
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
    Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
    Kernel driver in use: snd_hda_intel
    Kernel modules: snd_hda_intel
04:00.0 Non-Volatile memory controller [0108]: Shenzhen Longsys Electronics Co., Ltd. Device [1d97:1602] (rev 01)
    Subsystem: Shenzhen Longsys Electronics Co., Ltd. Device [1d97:1602]
    Kernel driver in use: nvme
    Kernel modules: nvme
05:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
    Subsystem: Mellanox Technologies Stand-up ConnectX-4 Lx EN, 25GbE dual-port SFP28, PCIe3.0 x8, MCX4121A-ACAT [15b3:0003]
    Kernel driver in use: mlx5_core
    Kernel modules: mlx5_core
05:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
    Subsystem: Mellanox Technologies Stand-up ConnectX-4 Lx EN, 25GbE dual-port SFP28, PCIe3.0 x8, MCX4121A-ACAT [15b3:0003]
    Kernel driver in use: mlx5_core
    Kernel modules: mlx5_core
05:01.2 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
    Subsystem: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:0003]
    Kernel driver in use: vfio-pci
    Kernel modules: mlx5_core
05:01.3 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
    Subsystem: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:0003]
    Kernel driver in use: vfio-pci
    Kernel modules: mlx5_core
05:01.4 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
    Subsystem: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:0003]
    Kernel driver in use: vfio-pci
    Kernel modules: mlx5_core
05:01.5 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
    Subsystem: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:0003]
    Kernel driver in use: vfio-pci
    Kernel modules: mlx5_core
05:01.6 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
    Subsystem: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:0003]
    Kernel driver in use: mlx5_core
    Kernel modules: mlx5_core
05:01.7 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
    Subsystem: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:0003]
    Kernel driver in use: mlx5_core
    Kernel modules: mlx5_core
05:02.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
    Subsystem: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:0003]
    Kernel driver in use: mlx5_core
    Kernel modules: mlx5_core
05:02.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
    Subsystem: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:0003]
    Kernel driver in use: mlx5_core
    Kernel modules: mlx5_core
06:00.0 Non-Volatile memory controller [0108]: Shenzhen Longsys Electronics Co., Ltd. Device [1d97:1602] (rev 01)
    Subsystem: Shenzhen Longsys Electronics Co., Ltd. Device [1d97:1602]
    Kernel driver in use: nvme
    Kernel modules: nvme

IOMMU groups:

Code:

IOMMU Group 0:
    00:02.0 Display controller [0380]: Intel Corporation AlderLake-S GT1 [8086:4680] (rev 0c)
IOMMU Group 1:
    00:00.0 Host bridge [0600]: Intel Corporation Device [8086:4648] (rev 02)
IOMMU Group 2:
    00:01.0 PCI bridge [0604]: Intel Corporation 12th Gen Core Processor PCI Express x16 Controller #1 [8086:460d] (rev 02)
IOMMU Group 3:
    00:14.0 USB controller [0c03]: Intel Corporation Alder Lake-S PCH USB 3.2 Gen 2x2 XHCI Controller [8086:7ae0] (rev 11)
    00:14.2 RAM memory [0500]: Intel Corporation Alder Lake-S PCH Shared SRAM [8086:7aa7] (rev 11)
IOMMU Group 4:
    00:17.0 SATA controller [0106]: Intel Corporation Alder Lake-S PCH SATA Controller [AHCI Mode] [8086:7ae2] (rev 11)
IOMMU Group 5:
    00:1b.0 PCI bridge [0604]: Intel Corporation Device [8086:7ac4] (rev 11)
IOMMU Group 6:
    00:1c.0 PCI bridge [0604]: Intel Corporation Alder Lake-S PCH PCI Express Root Port #1 [8086:7ab8] (rev 11)
IOMMU Group 7:
    00:1d.0 PCI bridge [0604]: Intel Corporation Alder Lake-S PCH PCI Express Root Port #13 [8086:7ab4] (rev 11)
IOMMU Group 8:
    00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:7a88] (rev 11)
    00:1f.4 SMBus [0c05]: Intel Corporation Alder Lake-S PCH SMBus Controller [8086:7aa3] (rev 11)
    00:1f.5 Serial bus controller [0c80]: Intel Corporation Alder Lake-S PCH SPI Controller [8086:7aa4] (rev 11)
    00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (17) I219-LM [8086:1a1c] (rev 11)
IOMMU Group 9:
    01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c1)
IOMMU Group 10:
    02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
IOMMU Group 11:
    03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c1)
IOMMU Group 12:
    03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller [1002:ab28]
IOMMU Group 13:
    04:00.0 Non-Volatile memory controller [0108]: Shenzhen Longsys Electronics Co., Ltd. Device [1d97:1602] (rev 01)
IOMMU Group 14:
    05:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
IOMMU Group 15:
    05:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
IOMMU Group 16:
    06:00.0 Non-Volatile memory controller [0108]: Shenzhen Longsys Electronics Co., Ltd. Device [1d97:1602] (rev 01)
IOMMU Group 17:
    05:01.2 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
IOMMU Group 18:
    05:01.3 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
IOMMU Group 19:
    05:01.4 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
IOMMU Group 20:
    05:01.5 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
IOMMU Group 21:
    05:01.6 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
IOMMU Group 22:
    05:01.7 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
IOMMU Group 23:
    05:02.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]
IOMMU Group 24:
    05:02.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] [15b3:1016]

wrobelda · Feb 27, 2024

mstconfig:

Code:

Device #1:
----------


Device type:    ConnectX4LX
Name:           MCX4121A-ACA_Ax
Description:    ConnectX-4 Lx EN network interface card; 25GbE dual-port SFP28; PCIe3.0 x8; ROHS R6
Device:         /sys/bus/pci/devices/0000:05:00.0/config


Configurations:                              Next Boot
         MEMIC_BAR_SIZE                      0
         MEMIC_SIZE_LIMIT                    _256KB(1)
         FLEX_PARSER_PROFILE_ENABLE          0
         FLEX_IPV4_OVER_VXLAN_PORT           0
         ROCE_NEXT_PROTOCOL                  254
         PF_NUM_OF_VF_VALID                  False(0)
         NON_PREFETCHABLE_PF_BAR             False(0)
         VF_VPD_ENABLE                       False(0)
         STRICT_VF_MSIX_NUM                  False(0)
         VF_NODNIC_ENABLE                    False(0)
         NUM_PF_MSIX_VALID                   True(1)
         NUM_OF_VFS                          8
         NUM_OF_PF                           2
         SRIOV_EN                            True(1)
         PF_LOG_BAR_SIZE                     5
         VF_LOG_BAR_SIZE                     0
         NUM_PF_MSIX                         63
         NUM_VF_MSIX                         11
         INT_LOG_MAX_PAYLOAD_SIZE            AUTOMATIC(0)
         PCIE_CREDIT_TOKEN_TIMEOUT           0
         ACCURATE_TX_SCHEDULER               False(0)
         PARTIAL_RESET_EN                    False(0)
         SW_RECOVERY_ON_ERRORS               False(0)
         RESET_WITH_HOST_ON_ERRORS           False(0)
         PCI_BUS0_RESTRICT_SPEED             PCI_GEN_1(0)
         PCI_BUS0_RESTRICT_ASPM              False(0)
         PCI_BUS0_RESTRICT_WIDTH             PCI_X1(0)
         PCI_BUS0_RESTRICT                   False(0)
         PCI_DOWNSTREAM_PORT_OWNER           Array[0..15]
         CQE_COMPRESSION                     BALANCED(0)
         IP_OVER_VXLAN_EN                    False(0)
         MKEY_BY_NAME                        False(0)
         UCTX_EN                             True(1)
         PCI_ATOMIC_MODE                     PCI_ATOMIC_DISABLED_EXT_ATOMIC_ENABLED(0)
         TUNNEL_ECN_COPY_DISABLE             False(0)
         LRO_LOG_TIMEOUT0                    6
         LRO_LOG_TIMEOUT1                    7
         LRO_LOG_TIMEOUT2                    8
         LRO_LOG_TIMEOUT3                    13
         ICM_CACHE_MODE                      DEVICE_DEFAULT(0)
         TX_SCHEDULER_BURST                  0
         LOG_MAX_QUEUE                       17
         LOG_DCR_HASH_TABLE_SIZE             14
         DCR_LIFO_SIZE                       16384
         ROCE_CC_PRIO_MASK_P1                255
         ROCE_CC_PRIO_MASK_P2                255
         CLAMP_TGT_RATE_AFTER_TIME_INC_P1    True(1)
         CLAMP_TGT_RATE_P1                   False(0)
         RPG_TIME_RESET_P1                   300
         RPG_BYTE_RESET_P1                   32767
         RPG_THRESHOLD_P1                    1
         RPG_MAX_RATE_P1                     0
         RPG_AI_RATE_P1                      5
         RPG_HAI_RATE_P1                     50
         RPG_GD_P1                           11
         RPG_MIN_DEC_FAC_P1                  50
         RPG_MIN_RATE_P1                     1
         RATE_TO_SET_ON_FIRST_CNP_P1         0
         DCE_TCP_G_P1                        1019
         DCE_TCP_RTT_P1                      1
         RATE_REDUCE_MONITOR_PERIOD_P1       4
         INITIAL_ALPHA_VALUE_P1              1023
         MIN_TIME_BETWEEN_CNPS_P1            4
         CNP_802P_PRIO_P1                    6
         CNP_DSCP_P1                         48
         CLAMP_TGT_RATE_AFTER_TIME_INC_P2    True(1)
         CLAMP_TGT_RATE_P2                   False(0)
         RPG_TIME_RESET_P2                   300
         RPG_BYTE_RESET_P2                   32767
         RPG_THRESHOLD_P2                    1
         RPG_MAX_RATE_P2                     0
         RPG_AI_RATE_P2                      5
         RPG_HAI_RATE_P2                     50
         RPG_GD_P2                           11
         RPG_MIN_DEC_FAC_P2                  50
         RPG_MIN_RATE_P2                     1
         RATE_TO_SET_ON_FIRST_CNP_P2         0
         DCE_TCP_G_P2                        1019
         DCE_TCP_RTT_P2                      1
         RATE_REDUCE_MONITOR_PERIOD_P2       4
         INITIAL_ALPHA_VALUE_P2              1023
         MIN_TIME_BETWEEN_CNPS_P2            4
         CNP_802P_PRIO_P2                    6
         CNP_DSCP_P2                         48
         LLDP_NB_DCBX_P1                     False(0)
         LLDP_NB_RX_MODE_P1                  OFF(0)
         LLDP_NB_TX_MODE_P1                  OFF(0)
         LLDP_NB_DCBX_P2                     False(0)
         LLDP_NB_RX_MODE_P2                  OFF(0)
         LLDP_NB_TX_MODE_P2                  OFF(0)
         DCBX_IEEE_P1                        True(1)
         DCBX_CEE_P1                         True(1)
         DCBX_WILLING_P1                     True(1)
         DCBX_IEEE_P2                        True(1)
         DCBX_CEE_P2                         True(1)
         DCBX_WILLING_P2                     True(1)
         KEEP_ETH_LINK_UP_P1                 True(1)
         KEEP_IB_LINK_UP_P1                  False(0)
         KEEP_LINK_UP_ON_BOOT_P1             False(0)
         KEEP_LINK_UP_ON_STANDBY_P1          False(0)
         DO_NOT_CLEAR_PORT_STATS_P1          False(0)
         AUTO_POWER_SAVE_LINK_DOWN_P1        False(0)
         KEEP_ETH_LINK_UP_P2                 True(1)
         KEEP_IB_LINK_UP_P2                  False(0)
         KEEP_LINK_UP_ON_BOOT_P2             False(0)
         KEEP_LINK_UP_ON_STANDBY_P2          False(0)
         DO_NOT_CLEAR_PORT_STATS_P2          False(0)
         AUTO_POWER_SAVE_LINK_DOWN_P2        False(0)
         NUM_OF_VL_P1                        _4_VLs(3)
         NUM_OF_TC_P1                        _8_TCs(0)
         NUM_OF_PFC_P1                       8
         VL15_BUFFER_SIZE_P1                 0
         NUM_OF_VL_P2                        _4_VLs(3)
         NUM_OF_TC_P2                        _8_TCs(0)
         NUM_OF_PFC_P2                       8
         VL15_BUFFER_SIZE_P2                 0
         DUP_MAC_ACTION_P1                   LAST_CFG(0)
         SRIOV_IB_ROUTING_MODE_P1            LID(1)
         IB_ROUTING_MODE_P1                  LID(1)
         DUP_MAC_ACTION_P2                   LAST_CFG(0)
         SRIOV_IB_ROUTING_MODE_P2            LID(1)
         IB_ROUTING_MODE_P2                  LID(1)
         PHY_FEC_OVERRIDE_P1                 DEVICE_DEFAULT(0)
         PHY_FEC_OVERRIDE_P2                 DEVICE_DEFAULT(0)
         ROCE_CONTROL                        ROCE_ENABLE(2)
         PCI_WR_ORDERING                     per_mkey(0)
         MULTI_PORT_VHCA_EN                  False(0)
         PORT_OWNER                          True(1)
         ALLOW_RD_COUNTERS                   True(1)
         RENEG_ON_CHANGE                     True(1)
         TRACER_ENABLE                       True(1)
         IP_VER                              IPv4(0)
         BOOT_UNDI_NETWORK_WAIT              0
         UEFI_HII_EN                         True(1)
         BOOT_DBG_LOG                        False(0)
         UEFI_LOGS                           DISABLED(0)
         BOOT_VLAN                           1
         LEGACY_BOOT_PROTOCOL                PXE(1)
         BOOT_RETRY_CNT                      NONE(0)
         BOOT_INTERRUPT_DIS                  False(0)
         BOOT_LACP_DIS                       True(1)
         BOOT_VLAN_EN                        False(0)
         BOOT_PKEY                           0
         DYNAMIC_VF_MSIX_TABLE               False(0)
         EXP_ROM_UEFI_ARM_ENABLE             False(0)
         EXP_ROM_UEFI_x86_ENABLE             False(0)
         EXP_ROM_PXE_ENABLE                  True(1)
         ADVANCED_PCI_SETTINGS               False(0)
         SAFE_MODE_THRESHOLD                 10
         SAFE_MODE_ENABLE                    True(1)

wrobelda · Feb 28, 2024

Well, it works. And I don't know what exactly happened. All I did was removing the Mellanox card from the system to find GPU passed through to Windows without any issues whatsoever and without any errors reported. Then I reinstalled Mellanox NIC and... it still works. No flooding in dmesg, no errors when passing VFs to VMs, the GPU passthrough still works.

Totally puzzled. Must be some Dell/UEFI peculiarity.

Search

Search

Mellanox driver in memory conflict with amdgpu?

wrobelda

Member

wrobelda

Member

wrobelda

Member

wrobelda

Member