Proxmox 9.0 Beta - kernel issues with vfio-pci on Mellanox 100G.

dominiaz

Renowned Member
Sep 16, 2016
53
11
73
38
Kernel is broken with Mellanox 100G Connectx-5 VF on Proxmox 9.0 Beta. That card works fine only on Host without VF, so vfio-pci is broken in that release I think.

kvm: -device vfio-pci,host=0000:81:00.1,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: vfio 0000:81:00.1: error getting device from group 89: Permission denied
Verify all devices in group 89 are bound to vfio-<bus> or pci-stub and not already in use

Code:
agent: 1
balloon: 0
boot: order=virtio0;ide2;net0
cores: 20
cpu: host
hostpci0: 0000:81:00.1,pcie=1
ide2: none,media=cdrom
machine: q35
memory: 32768
meta: creation-qemu=9.2.0,ctime=1752404131
name: debian13rc2
net0: virtio=BC:24:11:80:0C:69,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsihw: virtio-scsi-single
smbios1: uuid=cbfa1f66-f75a-48f8-83ed-f014ba8b1089
sockets: 1
virtio0: local-zfs:vm-20100-disk-0,cache=directsync,discard=on,iothread=1,size=200G
virtio1: xiraid2:20100/vm-20100-disk-0.raw,aio=native,cache=directsync,iothread=1,size=32G
vmgenid: e0daa015-4aa1-49e4-81ac-f4217fb4d28e

Mellanox Connectx-5 100G
Code:
echo 8 | sudo tee /sys/class/net/ens3np0/device/sriov_numvfs

Code:
lspci | grep Mellanox
81:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
81:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
81:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
81:00.3 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
81:00.4 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
81:00.5 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
81:00.6 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
81:00.7 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
81:01.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]

Code:
# journalctl -b 0 | grep -i iommu
Jul 20 22:51:12 s2 kernel: iommu: Default domain type: Translated
Jul 20 22:51:12 s2 kernel: iommu: DMA domain TLB invalidation policy: lazy mode
Jul 20 22:51:12 s2 kernel: pci 0000:c0:00.2: AMD-Vi: IOMMU performance counters supported
Jul 20 22:51:12 s2 kernel: pci 0000:c0:01.0: Adding to iommu group 0
Jul 20 22:51:12 s2 kernel: pci 0000:c0:01.1: Adding to iommu group 1
Jul 20 22:51:12 s2 kernel: pci 0000:c0:01.2: Adding to iommu group 2
Jul 20 22:51:12 s2 kernel: pci 0000:c0:01.3: Adding to iommu group 3
Jul 20 22:51:12 s2 kernel: pci 0000:c0:01.4: Adding to iommu group 4
Jul 20 22:51:12 s2 kernel: pci 0000:c0:02.0: Adding to iommu group 5
Jul 20 22:51:12 s2 kernel: pci 0000:c0:03.0: Adding to iommu group 6
Jul 20 22:51:12 s2 kernel: pci 0000:c0:04.0: Adding to iommu group 7
Jul 20 22:51:12 s2 kernel: pci 0000:c0:05.0: Adding to iommu group 8
Jul 20 22:51:12 s2 kernel: pci 0000:c0:05.2: Adding to iommu group 8
Jul 20 22:51:12 s2 kernel: pci 0000:c0:07.0: Adding to iommu group 9
Jul 20 22:51:12 s2 kernel: pci 0000:c0:07.1: Adding to iommu group 10
Jul 20 22:51:12 s2 kernel: pci 0000:c0:08.0: Adding to iommu group 11
Jul 20 22:51:12 s2 kernel: pci 0000:c0:08.1: Adding to iommu group 12
Jul 20 22:51:12 s2 kernel: pci 0000:c1:00.0: Adding to iommu group 13
Jul 20 22:51:12 s2 kernel: pci 0000:c2:00.0: Adding to iommu group 14
Jul 20 22:51:12 s2 kernel: pci 0000:c3:00.0: Adding to iommu group 15
Jul 20 22:51:12 s2 kernel: pci 0000:c4:00.0: Adding to iommu group 16
Jul 20 22:51:12 s2 kernel: pci 0000:c5:00.0: Adding to iommu group 8
Jul 20 22:51:12 s2 kernel: pci 0000:c6:00.0: Adding to iommu group 8
Jul 20 22:51:12 s2 kernel: pci 0000:c7:00.0: Adding to iommu group 17
Jul 20 22:51:12 s2 kernel: pci 0000:c7:00.2: Adding to iommu group 18
Jul 20 22:51:12 s2 kernel: pci 0000:c8:00.0: Adding to iommu group 19
Jul 20 22:51:12 s2 kernel: pci 0000:c8:00.2: Adding to iommu group 20
Jul 20 22:51:12 s2 kernel: pci 0000:80:00.2: AMD-Vi: IOMMU performance counters supported
Jul 20 22:51:12 s2 kernel: pci 0000:80:01.0: Adding to iommu group 21
Jul 20 22:51:12 s2 kernel: pci 0000:80:01.1: Adding to iommu group 22
Jul 20 22:51:12 s2 kernel: pci 0000:80:02.0: Adding to iommu group 23
Jul 20 22:51:12 s2 kernel: pci 0000:80:03.0: Adding to iommu group 24
Jul 20 22:51:12 s2 kernel: pci 0000:80:03.1: Adding to iommu group 24
Jul 20 22:51:12 s2 kernel: pci 0000:80:03.2: Adding to iommu group 24
Jul 20 22:51:12 s2 kernel: pci 0000:80:03.3: Adding to iommu group 25
Jul 20 22:51:12 s2 kernel: pci 0000:80:03.4: Adding to iommu group 26
Jul 20 22:51:12 s2 kernel: pci 0000:80:04.0: Adding to iommu group 27
Jul 20 22:51:12 s2 kernel: pci 0000:80:05.0: Adding to iommu group 28
Jul 20 22:51:12 s2 kernel: pci 0000:80:07.0: Adding to iommu group 29
Jul 20 22:51:12 s2 kernel: pci 0000:80:07.1: Adding to iommu group 30
Jul 20 22:51:12 s2 kernel: pci 0000:80:08.0: Adding to iommu group 31
Jul 20 22:51:12 s2 kernel: pci 0000:80:08.1: Adding to iommu group 32
Jul 20 22:51:12 s2 kernel: pci 0000:81:00.0: Adding to iommu group 33
Jul 20 22:51:12 s2 kernel: pci 0000:84:00.0: Adding to iommu group 34
Jul 20 22:51:12 s2 kernel: pci 0000:85:00.0: Adding to iommu group 35
Jul 20 22:51:12 s2 kernel: pci 0000:86:00.0: Adding to iommu group 36
Jul 20 22:51:12 s2 kernel: pci 0000:86:00.2: Adding to iommu group 37
Jul 20 22:51:12 s2 kernel: pci 0000:87:00.0: Adding to iommu group 38
Jul 20 22:51:12 s2 kernel: pci 0000:87:00.2: Adding to iommu group 39
Jul 20 22:51:12 s2 kernel: pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
Jul 20 22:51:12 s2 kernel: pci 0000:40:01.0: Adding to iommu group 40
Jul 20 22:51:12 s2 kernel: pci 0000:40:01.3: Adding to iommu group 41
Jul 20 22:51:12 s2 kernel: pci 0000:40:01.4: Adding to iommu group 40
Jul 20 22:51:12 s2 kernel: pci 0000:40:02.0: Adding to iommu group 42
Jul 20 22:51:12 s2 kernel: pci 0000:40:03.0: Adding to iommu group 43
Jul 20 22:51:12 s2 kernel: pci 0000:40:03.1: Adding to iommu group 44
Jul 20 22:51:12 s2 kernel: pci 0000:40:04.0: Adding to iommu group 45
Jul 20 22:51:12 s2 kernel: pci 0000:40:05.0: Adding to iommu group 46
Jul 20 22:51:12 s2 kernel: pci 0000:40:07.0: Adding to iommu group 47
Jul 20 22:51:12 s2 kernel: pci 0000:40:07.1: Adding to iommu group 48
Jul 20 22:51:12 s2 kernel: pci 0000:40:08.0: Adding to iommu group 49
Jul 20 22:51:12 s2 kernel: pci 0000:40:08.1: Adding to iommu group 50
Jul 20 22:51:12 s2 kernel: pci 0000:40:08.2: Adding to iommu group 51
Jul 20 22:51:12 s2 kernel: pci 0000:40:08.3: Adding to iommu group 52
Jul 20 22:51:12 s2 kernel: pci 0000:41:00.0: Adding to iommu group 53
Jul 20 22:51:12 s2 kernel: pci 0000:43:00.0: Adding to iommu group 54
Jul 20 22:51:12 s2 kernel: pci 0000:44:00.0: Adding to iommu group 55
Jul 20 22:51:12 s2 kernel: pci 0000:44:00.2: Adding to iommu group 56
Jul 20 22:51:12 s2 kernel: pci 0000:45:00.0: Adding to iommu group 57
Jul 20 22:51:12 s2 kernel: pci 0000:45:00.1: Adding to iommu group 58
Jul 20 22:51:12 s2 kernel: pci 0000:45:00.2: Adding to iommu group 59
Jul 20 22:51:12 s2 kernel: pci 0000:45:00.3: Adding to iommu group 60
Jul 20 22:51:12 s2 kernel: pci 0000:46:00.0: Adding to iommu group 61
Jul 20 22:51:12 s2 kernel: pci 0000:47:00.0: Adding to iommu group 62
Jul 20 22:51:12 s2 kernel: pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
Jul 20 22:51:12 s2 kernel: pci 0000:00:00.0: Adding to iommu group 63
Jul 20 22:51:12 s2 kernel: pci 0000:00:01.0: Adding to iommu group 64
Jul 20 22:51:12 s2 kernel: pci 0000:00:01.1: Adding to iommu group 65
Jul 20 22:51:12 s2 kernel: pci 0000:00:02.0: Adding to iommu group 66
Jul 20 22:51:12 s2 kernel: pci 0000:00:03.0: Adding to iommu group 67
Jul 20 22:51:12 s2 kernel: pci 0000:00:03.1: Adding to iommu group 67
Jul 20 22:51:12 s2 kernel: pci 0000:00:03.2: Adding to iommu group 67
Jul 20 22:51:12 s2 kernel: pci 0000:00:03.3: Adding to iommu group 68
Jul 20 22:51:12 s2 kernel: pci 0000:00:03.4: Adding to iommu group 69
Jul 20 22:51:12 s2 kernel: pci 0000:00:04.0: Adding to iommu group 70
Jul 20 22:51:12 s2 kernel: pci 0000:00:05.0: Adding to iommu group 71
Jul 20 22:51:12 s2 kernel: pci 0000:00:07.0: Adding to iommu group 72
Jul 20 22:51:12 s2 kernel: pci 0000:00:07.1: Adding to iommu group 73
Jul 20 22:51:12 s2 kernel: pci 0000:00:08.0: Adding to iommu group 74
Jul 20 22:51:12 s2 kernel: pci 0000:00:08.1: Adding to iommu group 75
Jul 20 22:51:12 s2 kernel: pci 0000:00:14.0: Adding to iommu group 76
Jul 20 22:51:12 s2 kernel: pci 0000:00:14.3: Adding to iommu group 76
Jul 20 22:51:12 s2 kernel: pci 0000:00:18.0: Adding to iommu group 77
Jul 20 22:51:12 s2 kernel: pci 0000:00:18.1: Adding to iommu group 77
Jul 20 22:51:12 s2 kernel: pci 0000:00:18.2: Adding to iommu group 77
Jul 20 22:51:12 s2 kernel: pci 0000:00:18.3: Adding to iommu group 77
Jul 20 22:51:12 s2 kernel: pci 0000:00:18.4: Adding to iommu group 77
Jul 20 22:51:12 s2 kernel: pci 0000:00:18.5: Adding to iommu group 77
Jul 20 22:51:12 s2 kernel: pci 0000:00:18.6: Adding to iommu group 77
Jul 20 22:51:12 s2 kernel: pci 0000:00:18.7: Adding to iommu group 77
Jul 20 22:51:12 s2 kernel: pci 0000:01:00.0: Adding to iommu group 78
Jul 20 22:51:12 s2 kernel: pci 0000:01:00.1: Adding to iommu group 79
Jul 20 22:51:12 s2 kernel: pci 0000:01:00.2: Adding to iommu group 80
Jul 20 22:51:12 s2 kernel: pci 0000:01:00.3: Adding to iommu group 81
Jul 20 22:51:12 s2 kernel: pci 0000:05:00.0: Adding to iommu group 82
Jul 20 22:51:12 s2 kernel: pci 0000:06:00.0: Adding to iommu group 83
Jul 20 22:51:12 s2 kernel: pci 0000:07:00.0: Adding to iommu group 84
Jul 20 22:51:12 s2 kernel: pci 0000:07:00.2: Adding to iommu group 85
Jul 20 22:51:12 s2 kernel: pci 0000:08:00.0: Adding to iommu group 86
Jul 20 22:51:12 s2 kernel: pci 0000:08:00.2: Adding to iommu group 87
Jul 20 22:51:12 s2 kernel: pci 0000:08:00.3: Adding to iommu group 88
Jul 20 22:51:12 s2 kernel: perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
Jul 20 22:51:12 s2 kernel: perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
Jul 20 22:51:12 s2 kernel: perf/amd_iommu: Detected AMD IOMMU #2 (2 banks, 4 counters/bank).
Jul 20 22:51:12 s2 kernel: perf/amd_iommu: Detected AMD IOMMU #3 (2 banks, 4 counters/bank).
Jul 20 22:51:17 s2 kernel: pci 0000:81:00.1: Adding to iommu group 89
Jul 20 22:51:17 s2 kernel: pci 0000:81:00.2: Adding to iommu group 90
Jul 20 22:51:17 s2 kernel: pci 0000:81:00.3: Adding to iommu group 91
Jul 20 22:51:18 s2 kernel: pci 0000:81:00.4: Adding to iommu group 92
Jul 20 22:51:18 s2 kernel: pci 0000:81:00.5: Adding to iommu group 93
Jul 20 22:51:18 s2 kernel: pci 0000:81:00.6: Adding to iommu group 94
Jul 20 22:51:19 s2 kernel: pci 0000:81:00.7: Adding to iommu group 95
Jul 20 22:51:19 s2 kernel: pci 0000:81:01.0: Adding to iommu group 96

Original kernel from Proxmox 9.0 Beta (kernel 6.14.8-1-pve) is only broken and that error appear when I am trying to redirect VF of Connectx-5.
Redirecting of Connectx-5 (whole pcie device) works fine.

Everything works fine on Proxmox 9.0 Beta (kernel 6.16.0-6-pve) proxmox-kernel-6.16.0-6-pve_6.16.0-6_amd64.deb (https://github.com/KrzysztofHajdamowicz/pve-kernel/releases)
Everything works fine on Proxmox 8.4 (kernel 6.8.12-12-pve)

Please help @dcsapak
 
Last edited:
I have the same problem as you. Hope to get an answer..
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module is not loaded
PCI devices:
------------
DEVICE_TYPE MST PCI RDMA NET NUMA
BlueField2(rev:1) NA 01:00.0 mlx5_0 net-enp1s0f0np0 3
1753051896824.png

And I only work normally when I fix the kernel version 6.14.5-1-bpo12-pve. I don't know what to do now.


c0b017c9-2ed0-4a7a-aad3-24979b20f532.png
 
I saw the issues , but the current problem seems to be unsolvable. must wait for the kernel to be updated or downgrade the current kernel.
I don't know if the next update will fix it.
T.T
 
Oh, by the way, this problem only affects KVM. LXC using phys passthrough is not affected by this problem. It can be used normally.
 
I saw the issues , but the current problem seems to be unsolvable. must wait for the kernel to be updated or downgrade the current kernel.
I don't know if the next update will fix it.
T.T
it seems like this,downgrade the kernel or wait for update
 
it seems like this,downgrade the kernel or wait for update
I tried to solve this problem for half a day. It was very strange.
The physical pass-through of LXC was normal, but KVM was not. During this period,
I changed the hardware configuration. I thought it was caused by my bios.I moved the server for half a day to verify it. This is really a sad story.
T.T...
 
Ok Not sure if this is related but I am having a similar and serious issue since I updated to PMX 9.

I have drilled down the issue to the kernel version.
The issue is definitely present with kernels 6.14.8-2-pve and 6.16.0-50-pve behaving exactly the same.

Interestingly kernel 6.8.12-13-pve DOES NOT have the issue and is behaving very differently and is working as expected.

Note that I encountered the issue mentionned above with Kernel 6.8.12-12-pve where I could not start VM at all with the same error message as stated above (kvm: -device vfio-pci,host=0000:81:00.1,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: vfio 0000:81:00.1: error getting device from group 89: Permission denied Verify all devices in group 89 are bound to vfio-<bus> or pci-stub and not already in use). But kernel 6.8.12-13-pve solved the issue.

Since migrating to kernel 6.14.8-2-pve, when starting a VM specifically calling for a VF on a Mellanox NIC (ConnectX-4 LX), the server, suddenly loses connectivity to the network and gets kicked out of the cluster, and all PVE services are offline. Note that I am still connected via SSH to the server.

Aug 09 13:22:09 pvf pvedaemon[11154]: VM 301 started with PID 11314.
Aug 09 13:22:09 pvf pvedaemon[7902]: <root@pam> end task UPID:pvf:00002B92:000037EE:68972F57:qmstart:301:root@pam: OK
Aug 09 13:22:13 pvf kernel: hrtimer: interrupt took 150339541 ns
Aug 09 13:22:16 pvf kernel: INFO: NMI handler (perf_event_nmi_handler) took too long to run: 50.112 msecs
Aug 09 13:22:16 pvf kernel: INFO: NMI handler (ghes_notify_nmi) took too long to run: 50.113 msecs
Aug 09 13:22:16 pvf kernel: perf: interrupt took too long (391495 > 2500), lowering kernel.perf_event_max_sample_rate to 1000
Aug 09 13:22:16 pvf kernel: INFO: NMI handler (perf_event_nmi_handler) took too long to run: 50.114 msecs
Aug 09 13:22:30 pvf pve-firewall[7816]: firewall update time (6.299 seconds)
Aug 09 13:22:33 pvf kernel: INFO: NMI handler (ghes_notify_nmi) took too long to run: 50.115 msecs
Aug 09 13:22:41 pvf pve-firewall[7816]: firewall update time (7.676 seconds)
Aug 09 13:22:41 pvf kernel: INFO: NMI handler (perf_event_nmi_handler) took too long to run: 50.115 msecs
Aug 09 13:22:41 pvf kernel: perf: interrupt took too long (773958 > 489368), lowering kernel.perf_event_max_sample_rate to 1000
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] link: host: 3 link: 0 is down
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] link: host: 1 link: 0 is down
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] link: host: 1 link: 1 is down
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] host: host: 3 has no active links
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] host: host: 1 has no active links
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] host: host: 1 has no active links
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Aug 09 13:22:49 pvf corosync[7742]: [TOTEM ] Token has not been received in 2737 ms

It then complains for several minutes (around 10 min), but interestingly eventually manages to stabilize itself without any external intervention, gets reattached to the cluster again and only then the VM starts as expected. It stays in a stable state as long as I don't restart the VM. The problem indeed occurs when resetting the VF to be attached to the VM.

This behavior does not occur at all if lauching a VM without calls to a VF and if the kernel is in version 6.8.12-13-pve. So my only solution for the moment is to pin the kernel to 6.8.12-13-pve, but I am not even sure if someone is looking after this issue.
Note that this server has been running without problems for several years now. I don't see anything obvious that could explain this issue apart from a problem with the way the new kernels handle VFs.
 
Ok Not sure if this is related but I am having a similar and serious issue since I updated to PMX 9.

I have drilled down the issue to the kernel version.
The issue is definitely present with kernels 6.14.8-2-pve and 6.16.0-50-pve behaving exactly the same.

Interestingly kernel 6.8.12-13-pve DOES NOT have the issue and is behaving very differently and is working as expected.

Note that I encountered the issue mentionned above with Kernel 6.8.12-12-pve where I could not start VM at all with the same error message as stated above (kvm: -device vfio-pci,host=0000:81:00.1,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0: vfio 0000:81:00.1: error getting device from group 89: Permission denied Verify all devices in group 89 are bound to vfio-<bus> or pci-stub and not already in use). But kernel 6.8.12-13-pve solved the issue.

Since migrating to kernel 6.14.8-2-pve, when starting a VM specifically calling for a VF on a Mellanox NIC (ConnectX-4 LX), the server, suddenly loses connectivity to the network and gets kicked out of the cluster, and all PVE services are offline. Note that I am still connected via SSH to the server.

Aug 09 13:22:09 pvf pvedaemon[11154]: VM 301 started with PID 11314.
Aug 09 13:22:09 pvf pvedaemon[7902]: <root@pam> end task UPID:pvf:00002B92:000037EE:68972F57:qmstart:301:root@pam: OK
Aug 09 13:22:13 pvf kernel: hrtimer: interrupt took 150339541 ns
Aug 09 13:22:16 pvf kernel: INFO: NMI handler (perf_event_nmi_handler) took too long to run: 50.112 msecs
Aug 09 13:22:16 pvf kernel: INFO: NMI handler (ghes_notify_nmi) took too long to run: 50.113 msecs
Aug 09 13:22:16 pvf kernel: perf: interrupt took too long (391495 > 2500), lowering kernel.perf_event_max_sample_rate to 1000
Aug 09 13:22:16 pvf kernel: INFO: NMI handler (perf_event_nmi_handler) took too long to run: 50.114 msecs
Aug 09 13:22:30 pvf pve-firewall[7816]: firewall update time (6.299 seconds)
Aug 09 13:22:33 pvf kernel: INFO: NMI handler (ghes_notify_nmi) took too long to run: 50.115 msecs
Aug 09 13:22:41 pvf pve-firewall[7816]: firewall update time (7.676 seconds)
Aug 09 13:22:41 pvf kernel: INFO: NMI handler (perf_event_nmi_handler) took too long to run: 50.115 msecs
Aug 09 13:22:41 pvf kernel: perf: interrupt took too long (773958 > 489368), lowering kernel.perf_event_max_sample_rate to 1000
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] link: host: 3 link: 0 is down
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] link: host: 1 link: 0 is down
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] link: host: 1 link: 1 is down
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] host: host: 3 has no active links
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] host: host: 1 has no active links
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] host: host: 1 has no active links
Aug 09 13:22:46 pvf corosync[7742]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Aug 09 13:22:49 pvf corosync[7742]: [TOTEM ] Token has not been received in 2737 ms

It then complains for several minutes (around 10 min), but interestingly eventually manages to stabilize itself without any external intervention, gets reattached to the cluster again and only then the VM starts as expected. It stays in a stable state as long as I don't restart the VM. The problem indeed occurs when resetting the VF to be attached to the VM.

This behavior does not occur at all if lauching a VM without calls to a VF and if the kernel is in version 6.8.12-13-pve. So my only solution for the moment is to pin the kernel to 6.8.12-13-pve, but I am not even sure if someone is looking after this issue.
Note that this server has been running without problems for several years now. I don't see anything obvious that could explain this issue apart from a problem with the way the new kernels handle VFs.
Have you try to use that kernel: https://github.com/KrzysztofHajdamowicz/pve-kernel/releases ?
 

Unfortunately yes. The link points to latest kernel 6.16.0-50-pve. Exact same behavior than with 6.14.8-2-pve

Here is my cmdline :

root=ZFS=rpool/ROOT/pve-1 boot=zfs intel_iommu=on iommu=pt intremap=no_x2apic_optout pci_pt_e820_access=on quiet net.naming-scheme=v252

On this specific server I had the parameters
intremap=no_x2apic_optout pci_pt_e820_access=on set for many years after fine tuning previous installation.
I believe that these parameters are the source of the issue with newer kernels.
If I remove these parameters from 6.8.12-13-pve boot I have then the exact same behavior than with newer kernels with these parameters set.


In dmesg I see the error logs after lauching the VM
[ 143.761016] sd 2:0:0:0: [sda] Synchronizing SCSI cache
[ 143.762216] ata3.00: Entering standby power mode
[ 144.394055] sd 3:0:0:0: [sdb] Synchronizing SCSI cache
[ 144.396183] ata4.00: Entering standby power mode
[ 145.031100] sd 4:0:0:0: [sdc] Synchronizing SCSI cache
[ 145.033532] ata5.00: Entering standby power mode
[ 145.660070] sd 5:0:0:0: [sdd] Synchronizing SCSI cache
[ 145.662182] ata6.00: Entering standby power mode
[ 147.879074] vfio-pci 0000:04:00.7: resetting
[ 147.984409] vfio-pci 0000:04:00.7: reset done
[ 148.015177] sd 12:0:0:0: [sdh] Synchronizing SCSI cache
[ 148.015450] ata13.00: Entering standby power mode
[ 148.599196] sd 13:0:0:0: [sdi] Synchronizing SCSI cache
[ 148.599443] ata14.00: Entering standby power mode
[ 149.191231] sd 14:0:0:0: [sdj] Synchronizing SCSI cache
[ 149.191497] ata15.00: Entering standby power mode
[ 149.773240] sd 15:0:0:0: [sdk] Synchronizing SCSI cache
[ 149.774284] ata16.00: Entering standby power mode
[ 150.965432] pcieport 0000:00:1c.6: Enabling MPC IRBNCE
[ 150.965438] pcieport 0000:00:1c.6: Intel PCH root port ACS workaround enabled
[ 150.977644] vfio-pci 0000:0a:00.0: resetting
[ 151.001916] vfio-pci 0000:0a:00.0: reset done
[ 153.077860] vfio-pci 0000:04:00.7: enabling device (0000 -> 0002)
[ 153.077885] vfio-pci 0000:04:00.7: resetting
[ 153.184710] vfio-pci 0000:04:00.7: reset done
[ 153.206521] pcieport 0000:00:1c.6: Enabling MPC IRBNCE
[ 153.206526] pcieport 0000:00:1c.6: Intel PCH root port ACS workaround enabled
[ 153.218619] vfio-pci 0000:0a:00.0: resetting
[ 153.242996] vfio-pci 0000:0a:00.0: reset done
[ 153.306442] vfio-pci 0000:0a:00.0: resetting
[ 153.416515] vfio-pci 0000:0a:00.0: reset done
[ 153.416700] vfio-pci 0000:04:00.7: resetting
[ 153.520667] vfio-pci 0000:04:00.7: reset done
[ 157.271319] hrtimer: interrupt took 150339541 ns
[ 160.528846] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 50.112 msecs
[ 160.679193] INFO: NMI handler (ghes_notify_nmi) took too long to run: 50.113 msecs
[ 160.829542] perf: interrupt took too long (391495 > 2500), lowering kernel.perf_event_max_sample_rate to 1000
[ 160.979887] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 50.114 msecs
[ 177.518104] INFO: NMI handler (ghes_notify_nmi) took too long to run: 50.115 msecs
[ 185.787203] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 50.115 msecs
[ 185.837317] perf: interrupt took too long (773958 > 489368), lowering kernel.perf_event_max_sample_rate to 1000
[ 205.282243] INFO: NMI handler (ghes_notify_nmi) took too long to run: 50.117 msecs
[ 267.776633] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 50.115 msecs
[ 284.565426] INFO: NMI handler (ghes_notify_nmi) took too long to run: 50.117 msecs
[ 367.456936] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 100.228 msecs
[ 367.607282] perf: interrupt took too long (1507506 > 967447), lowering kernel.perf_event_max_sample_rate to 1000
[ 401.285095] sched: DL replenish lagged too much

[B][ 475.356228] INFO: NMI handler (ghes_notify_nmi) took too long to run: 100.229 msecs[/B]

My VM conf is

agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0
cores: 4
cpu: host
efidisk0: local-zfs:vm-301-disk-2,efitype=4m,size=1M
[B]hostpci0: 0000:00:11.4,pcie=1[/B]
hostpci1: 0000:04:00.7,pcie=1 <--This is the VF
[B]hostpci2: 0000:0a:00,pcie=1[/B]
machine: q35,viommu=virtio
memory: 16384
meta: creation-qemu=8.0.2,ctime=1695036555
name: PHACO
numa: 0
ostype: other
protection: 1
scsi0: local-zfs:vm-301-disk-1,discard=on,iothread=1,size=60G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=4ffcda14-0f53-4abc-9430-9612c559e0b6
sockets: 1
tags: truenas
vmgenid: ec353e94-5d82-4d49-9e83-b5f5152b8462

Also since updating I have the following console message at shutdown and at restart

watchdog: watchdog0: watchdog did not stop!
watchdog: watchdog0: watchdog did not stop!


I did not have this before the upgrade. Just in case this has to do with NMI Handler one way or another...

From dmesg

[ 0.762654] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.

Note that the upgrade ran fine without error or crash.
 
Last edited:
So the issue seems to be related to the kernel parameter intremap=no_x2apic_optout which cannot be transposed to 6.14.8-2-pve and 6.16.0-50-pve without causing the passthrough to crash the NIC when calling the VF at start of VM. This kernel parameter is required with my motherboard so it is specific to my system.

For the moment I have to pin the kernel to 6.8.12-13-pve in order to have the VM run without issue.
 
So the issue seems to be related to the kernel parameter intremap=no_x2apic_optout which cannot be transposed to 6.14.8-2-pve and 6.16.0-50-pve without causing the passthrough to crash the NIC when calling the VF at start of VM. This kernel parameter is required with my motherboard so it is specific to my system.

For the moment I have to pin the kernel to 6.8.12-13-pve in order to have the VM run without issue.
I confirm a drastic change in behavior on the way VF passthrough is handled by 6.14.8-2-pve and 6.16.0-50-pve kernels when calling VF passthrough. I have made many tests changing kernel parameters but all comes down to the fact that everything runs fine with intremap=no_x2apic_optout parameter on 6.8.12-13-pve kernel and crashes the NIC using newer kernels whatever kernel parameter I have tried. Interestingly the server manages to fall back to a working state with VM properly started but after many minutes. I see no other error messages compared to the above I have posted. I run a recent firmware

Device #1:
----------

Device Type: ConnectX4LX
Part Number: MCX4121A-ACA_Ax
Description: ConnectX-4 Lx EN network interface card; 25GbE dual-port SFP28; PCIe3.0 x8; ROHS R6
PSID: MT_2420110034
PCI Device Name: 04:00.0
Base MAC: ec0d9ac01294
Versions: Current Available
FW 14.32.1010 14.32.1010
PXE 3.6.0502 3.6.0502
UEFI 14.25.0017 14.25.0017

Status: Up to date

So for the moment I am stuck using kernel 6.8.12-13-pve