VM crashes host cluster network when using virtual function since kernel 6.14

Lefuneste

Renowned Member
Mar 7, 2017
19
20
68
53
I am struggling trying to find a clue on what is going on with my configuration. I have posted a previous post for this issue but get no feedback.

The server is part of a running cluster of 3 nodes. It has been rock solid for many years through multiple updates. Since upgrading post kernel 6.8 (ie 6.14, 6.16, 6.17), everytime one specific qemu VM is started, the host server becomes unable to communicate with the cluster for several minutes. It then comes back online without specific actions. I am able to reproduce this behavior as soon as I go to a kernel version posterior to 6.8.12-13.pve.
The problem seems related to the call of a virtual function used by the VM. I use a Mellanox Connectx 4 LX MCX4121A-ACAT
Both ports are expanded to 5 VF each.

Here is the syslog of the host at VM start

Nov 25 18:34:11 pvf pvedaemon[46943]: start VM 301: UPID:pvf:0000B75F:0003F74F:6925E893:qmstart:301:root@pam:
Nov 25 18:34:11 pvf pvedaemon[7877]: <root@pam> starting task UPID:pvf:0000B75F:0003F74F:6925E893:qmstart:301:root@pam:
Nov 25 18:34:11 pvf kernel: vfio-pci 0000:04:00.7: resetting
Nov 25 18:34:11 pvf kernel: vfio-pci 0000:04:00.7: reset done
Nov 25 18:34:11 pvf kernel: pcieport 0000:00:1c.6: Enabling MPC IRBNCE
Nov 25 18:34:11 pvf kernel: pcieport 0000:00:1c.6: Intel PCH root port ACS workaround enabled
Nov 25 18:34:11 pvf kernel: vfio-pci 0000:0a:00.0: resetting
Nov 25 18:34:11 pvf kernel: vfio-pci 0000:0a:00.0: reset done
Nov 25 18:34:11 pvf systemd[1]: Started 301.scope.
Nov 25 18:34:13 pvf kernel: vfio-pci 0000:04:00.7: enabling device (0000 -> 0002)
Nov 25 18:34:13 pvf kernel: vfio-pci 0000:04:00.7: resetting
Nov 25 18:34:13 pvf kernel: vfio-pci 0000:04:00.7: reset done
Nov 25 18:34:13 pvf kernel: pcieport 0000:00:1c.6: Enabling MPC IRBNCE
Nov 25 18:34:13 pvf kernel: pcieport 0000:00:1c.6: Intel PCH root port ACS workaround enabled
Nov 25 18:34:13 pvf kernel: vfio-pci 0000:0a:00.0: resetting
Nov 25 18:34:13 pvf kernel: vfio-pci 0000:0a:00.0: reset done
Nov 25 18:34:13 pvf kernel: vfio-pci 0000:0a:00.0: resetting
Nov 25 18:34:13 pvf kernel: vfio-pci 0000:0a:00.0: reset done
Nov 25 18:34:13 pvf kernel: vfio-pci 0000:04:00.7: resetting
Nov 25 18:34:14 pvf kernel: vfio-pci 0000:04:00.7: reset done
Nov 25 18:34:14 pvf pvedaemon[46943]: VM 301 started with PID 46960.
Nov 25 18:34:14 pvf pvedaemon[7877]: <root@pam> end task UPID:pvf:0000B75F:0003F74F:6925E893:qmstart:301:root@pam: OK
Nov 25 18:34:15 pvf chronyd[46343]: Selected source 192.168.0.247
Nov 25 18:34:25 pvf corosync[7723]: [KNET ] link: host: 1 link: 0 is down
Nov 25 18:34:25 pvf corosync[7723]: [KNET ] link: host: 1 link: 1 is down
Nov 25 18:34:25 pvf corosync[7723]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 25 18:34:25 pvf corosync[7723]: [KNET ] host: host: 1 has no active links
Nov 25 18:34:25 pvf corosync[7723]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1)
Nov 25 18:34:25 pvf corosync[7723]: [KNET ] host: host: 1 has no active links
Nov 25 18:34:25 pvf corosync[7723]: [KNET ] link: host: 3 link: 0 is down
Nov 25 18:34:25 pvf corosync[7723]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 25 18:34:25 pvf corosync[7723]: [KNET ] host: host: 3 has no active links
Nov 25 18:34:26 pvf corosync[7723]: [TOTEM ] Token has not been received in 2737 ms
Nov 25 18:34:27 pvf corosync[7723]: [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
Nov 25 18:34:28 pvf corosync[7723]: [KNET ] link: Resetting MTU for link 0 because host 3 joined
Nov 25 18:34:38 pvf corosync[7723]: [TOTEM ] Token has not been received in 15385 ms
Nov 25 18:34:40 pvf pve-firewall[7797]: firewall update time (16.150 seconds)
Nov 25 18:34:46 pvf corosync[7723]: [TOTEM ] Token has not been received in 23203 ms
Nov 25 18:34:47 pvf pve-firewall[7797]: firewall update time (7.370 seconds)
Nov 25 18:34:54 pvf corosync[7723]: [TOTEM ] Token has not been received in 31233 ms
Nov 25 18:35:01 pvf CRON[47183]: pam_unix(cron:session): session opened for user root(uid=0) by root(uid=0)
Nov 25 18:35:01 pvf CRON[47185]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Nov 25 18:35:01 pvf CRON[47183]: pam_unix(cron:session): session closed for user root
Nov 25 18:35:02 pvf corosync[7723]: [TOTEM ] Token has not been received in 39263 ms
Nov 25 18:35:04 pvf corosync[7723]: [QUORUM] Sync members[1]: 2
Nov 25 18:35:04 pvf corosync[7723]: [QUORUM] Sync left[2]: 1 3
Nov 25 18:35:04 pvf corosync[7723]: [TOTEM ] A new membership (2.45ca) was formed. Members left: 1 3
Nov 25 18:35:04 pvf corosync[7723]: [TOTEM ] Failed to receive the leave message. failed: 1 3
Nov 25 18:35:04 pvf pmxcfs[7521]: [dcdb] notice: members: 2/7521
Nov 25 18:35:04 pvf pmxcfs[7521]: [status] notice: members: 2/7521
Nov 25 18:35:04 pvf corosync[7723]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov 25 18:35:04 pvf corosync[7723]: [QUORUM] Members[1]: 2
Nov 25 18:35:04 pvf corosync[7723]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 25 18:35:04 pvf pmxcfs[7521]: [status] notice: node lost quorum
Nov 25 18:35:04 pvf pmxcfs[7521]: [dcdb] crit: received write while not quorate - trigger resync
Nov 25 18:35:04 pvf pmxcfs[7521]: [dcdb] crit: leaving CPG group
Nov 25 18:35:05 pvf corosync[7723]: [KNET ] link: host: 3 link: 0 is down
Nov 25 18:35:05 pvf corosync[7723]: [KNET ] pmtud: Global data MTU changed to: 1397
Nov 25 18:35:05 pvf corosync[7723]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 25 18:35:05 pvf corosync[7723]: [KNET ] host: host: 3 has no active links
Nov 25 18:35:05 pvf corosync[7723]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Nov 25 18:35:05 pvf corosync[7723]: [KNET ] host: host: 3 has no active links
Nov 25 18:35:05 pvf pmxcfs[7521]: [dcdb] notice: start cluster connection
Nov 25 18:35:05 pvf pmxcfs[7521]: [dcdb] crit: cpg_join failed: CS_ERR_EXIST
Nov 25 18:35:05 pvf pve-ha-lrm[8688]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pvf/lrm_status.tmp.8688' - Permission denied
Nov 25 18:35:05 pvf pmxcfs[7521]: [dcdb] crit: can't initialize service
Nov 25 18:35:05 pvf pve-ha-crm[8143]: loop take too long (42 seconds)
Nov 25 18:35:05 pvf pve-ha-crm[8143]: status change slave => wait_for_quorum
Nov 25 18:35:06 pvf corosync[7723]: [KNET ] link: Resetting MTU for link 0 because host 1 joined
Nov 25 18:35:09 pvf corosync[7723]: [TOTEM ] Token has not been received in 3257 ms
Nov 25 18:35:09 pvf corosync[7723]: [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
Nov 25 18:35:10 pvf pve-ha-lrm[8688]: loop take too long (47 seconds)
Nov 25 18:35:10 pvf pvescheduler[47181]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Nov 25 18:35:10 pvf pvescheduler[47182]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Nov 25 18:35:14 pvf pvestatd[7796]: status update time (55.831 seconds)
Nov 25 18:35:16 pvf corosync[7723]: [TOTEM ] Token has not been received in 11026 ms
Nov 25 18:35:24 pvf corosync[7723]: [TOTEM ] Token has not been received in 19056 ms
Nov 25 18:35:32 pvf corosync[7723]: [TOTEM ] Token has not been received in 27086 ms
Nov 25 18:35:36 pvf pve-firewall[7797]: firewall update time (25.732 seconds)
Nov 25 18:35:40 pvf corosync[7723]: [TOTEM ] Token has not been received in 35161 ms
Nov 25 18:35:42 pvf corosync[7723]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly.
I then loose the status of the host in the Proxmox panel (server icon becomes red) but I am still able to access it via SSH and I can still get some status information in the Proxmox panel.

Here is the VM config file.
agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0
cores: 4
cpu: host
efidisk0: local-zfs:vm-301-disk-2,efitype=4m,size=1M
hostpci0: 0000:00:11.4,pcie=1
hostpci1: 0000:04:00.7,pcie=1
hostpci2: 0000:0a:00,pcie=1
machine: q35
memory: 16384
meta: creation-qemu=8.0.2,ctime=1695036555
name: PHACO
numa: 0
ostype: other
protection: 1
scsi0: local-zfs:vm-301-disk-1,discard=on,iothread=1,size=60G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=4ffcda14-0f53-4abc-9430-9612c559e0b6
sockets: 1
tags: truenas
vmgenid: ec353e94-5d82-4d49-9e83-b5f5152b8462

The network configuration of the host is

source /etc/network/interfaces.d/*

auto lo
iface lo inet loopback

auto enge0
iface enge0 inet static
address 10.168.0.9/24
#I210 Gigabit Network Connection

iface enge0 inet6 static
address 2a01:e0a:aa6:3a84:10:168:0:9/64

auto enge1
iface enge1 inet manual
#I217 Gigabit Network Connection

auto ensx0
iface ensx0 inet manual
ovs_type OVSPort
ovs_bridge vmbr0
ovs_options trunks=0,1,80,107,168 tag=1 vlan_mode=native-untagged
#MT27520 Family [ConnectX-3 Pro] A

auto ensx1
iface ensx1 inet manual
ovs_type OVSPort
ovs_bridge vmbr1
ovs_options trunks=0,1,80,107,168 tag=1 vlan_mode=native-untagged
#MT27520 Family [ConnectX-3 Pro] B

iface ens6f0v1 inet manual

iface ens6f0v2 inet manual

iface ens6f0v3 inet manual

iface ens6f0v4 inet manual

iface ens6f1v0 inet manual

iface ens6f1v1 inet manual

iface ens6f1v2 inet manual

iface ens6f1v3 inet manual

iface ens6f1v4 inet manual

iface ens6f0v0 inet manual

auto vlan1
iface vlan1 inet static
address 192.168.0.9/24
gateway 192.168.0.50
ovs_type OVSIntPort
ovs_bridge vmbr0
ovs_options tag=1

iface vlan1 inet6 static
address 2a01:e0a:aa6:3a81:192:168:0:9/64
gateway 2a01:e0a:aa6:3a81:192:168:0:50

auto vmbr0
iface vmbr0 inet manual
ovs_type OVSBridge
ovs_ports ensx0 vlan1

auto vmbr1
iface vmbr1 inet manual
ovs_type OVSBridge
ovs_ports ensx1

The ip link enumeration once the server recovers from hanging state. Note that while the server is is limbo this command hangs. Note that the VM takes the VF ens6f1v0 so it is not listed anymore

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enge0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 90:1b:0e:d4:d5:23 brd ff:ff:ff:ff:ff:ff
5: enge1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 90:1b:0e:a9:a6:3a brd ff:ff:ff:ff:ff:ff
6: ensx0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
link/ether ec:0d:9a:c0:12:94 brd ff:ff:ff:ff:ff:ff
vf 0 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state enable, trust off, query_rss off
vf 1 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state enable, trust off, query_rss off
vf 2 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state enable, trust off, query_rss off
vf 3 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state enable, trust off, query_rss off
vf 4 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state enable, trust off, query_rss off
7: ensx1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
link/ether ec:0d:9a:c0:12:95 brd ff:ff:ff:ff:ff:ff
vf 0 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state enable, trust off, query_rss off
vf 1 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state enable, trust off, query_rss off
vf 2 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state enable, trust off, query_rss off
vf 3 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state enable, trust off, query_rss off
vf 4 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state enable, trust off, query_rss off
9: ens6f0v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ec:0d:9a:c0:12:96 brd ff:ff:ff:ff:ff:ff
altname enp4s0f0v0
10: ens6f1v1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ec:0d:9a:c0:12:9c brd ff:ff:ff:ff:ff:ff
altname enp4s0f1v1
11: ens6f0v1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ec:0d:9a:c0:12:97 brd ff:ff:ff:ff:ff:ff
altname enp4s0f0v1
12: ens6f1v2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ec:0d:9a:c0:12:9d brd ff:ff:ff:ff:ff:ff
altname enp4s0f1v2
13: ens6f0v2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ec:0d:9a:c0:12:98 brd ff:ff:ff:ff:ff:ff
altname enp4s0f0v2
14: ens6f1v3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ec:0d:9a:c0:12:9e brd ff:ff:ff:ff:ff:ff
altname enp4s0f1v3
15: ens6f0v3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ec:0d:9a:c0:12:99 brd ff:ff:ff:ff:ff:ff
altname enp4s0f0v3
16: ens6f1v4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ec:0d:9a:c0:12:9f brd ff:ff:ff:ff:ff:ff
altname enp4s0f1v4
17: ens6f0v4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether ec:0d:9a:c0:12:9a brd ff:ff:ff:ff:ff:ff
altname enp4s0f0v4
18: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 06:a3:45:b2:ea:b8 brd ff:ff:ff:ff:ff:ff
19: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/ether ec:0d:9a:c0:12:94 brd ff:ff:ff:ff:ff:ff
20: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/ether ec:0d:9a:c0:12:95 brd ff:ff:ff:ff:ff:ff
21: vlan1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/ether a6:7f:cc:d4:d7:99 brd ff:ff:ff:ff:ff:ff

I have drilled down the issue to some extent to the boot parameter intremap=no_x2apic_optout

cat /etc/kernel/cmdline
root=ZFS=rpool/ROOT/pve-1 boot=zfs intel_iommu=on iommu=pt intremap=no_x2apic_optout pci_pt_e820_access=on quiet net.naming-scheme=v252
 
Last edited: