Purple screen NMI when enabling HA

Paul B. Henson · 2025-09-04T06:05:39+0200

We recently set up a proxmox cluster for the first time. There are two nodes and a qdevice. Everything was working great until I tested enabling HA on a vm. A minute or so after, both proxmox nodes ended up at a purple kernel crash screen saying there had been an NMI. The server SP reported the watchdog had fired on each (I thought the watchdog forced a reboot though, as opposed to a kernel crash screen).

I hard reset the nodes, they came up, and immediately purple screened again. I reset them again, and this time was able to get into the webui and disable HA on that vm and they stopped crashing. Looking through the logs, I'm not clear what happened? One node shows:

Sep 03 21:56:17 corosync[1580]: [KNET ] link: host: 2 link: 0 is down
Sep 03 21:56:17 corosync[1580]: [KNET ] link: host: 2 link: 1 is down
Sep 03 21:56:17 corosync[1580]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 03 21:56:17 corosync[1580]: [KNET ] host: host: 2 has no active links
Sep 03 21:56:17 corosync[1580]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 03 21:56:17 corosync[1580]: [KNET ] host: host: 2 has no active links
Sep 03 21:56:18 corosync[1580]: [TOTEM ] Token has not been received in 2250 ms
Sep 03 21:56:19 corosync[1580]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensu>
Sep 03 21:56:23 corosync[1580]: [QUORUM] Sync members[1]: 1
Sep 03 21:56:23 corosync[1580]: [QUORUM] Sync left[1]: 2
Sep 03 21:56:23 corosync[1580]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Sep 03 21:56:23 corosync[1580]: [TOTEM ] A new membership (1.41) was formed. Members left: 2
Sep 03 21:56:23 corosync[1580]: [TOTEM ] Failed to receive the leave message. failed: 2
Sep 03 21:56:23 pmxcfs[1433]: [dcdb] notice: members: 1/1433
Sep 03 21:56:23 pmxcfs[1433]: [status] notice: members: 1/1433
Sep 03 21:56:23 corosync[1580]: [QUORUM] Members[1]: 1
Sep 03 21:56:23 corosync[1580]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 03 21:56:23 pve-ha-crm[1651]: status change wait_for_quorum => slave
-- Boot f86039067cd149e8a29471c317c0d802 --

I think this is the node that crashed second, saying the other node went away. The node I think crashed first says:

Sep 03 21:54:51 pvedaemon[1641]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Sep 03 21:55:17 pvedaemon[1643]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
-- Boot d770df1e75e14c32833b4e3cab5ac9f0 --

The second time around, the first node said:

Sep 03 22:04:36 corosync[1581]: [KNET ] link: host: 2 link: 0 is down
Sep 03 22:04:36 corosync[1581]: [KNET ] link: host: 2 link: 1 is down
Sep 03 22:04:36 corosync[1581]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 03 22:04:36 corosync[1581]: [KNET ] host: host: 2 has no active links
Sep 03 22:04:36 corosync[1581]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 03 22:04:36 corosync[1581]: [KNET ] host: host: 2 has no active links
Sep 03 22:04:37 corosync[1581]: [TOTEM ] Token has not been received in 2250 ms
Sep 03 22:04:38 corosync[1581]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensu>
Sep 03 22:04:42 corosync[1581]: [QUORUM] Sync members[1]: 1
Sep 03 22:04:42 corosync[1581]: [QUORUM] Sync left[1]: 2
Sep 03 22:04:42 corosync[1581]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Sep 03 22:04:42 corosync[1581]: [TOTEM ] A new membership (1.4e) was formed. Members left: 2
Sep 03 22:04:42 corosync[1581]: [TOTEM ] Failed to receive the leave message. failed: 2
Sep 03 22:04:42 pmxcfs[1434]: [dcdb] notice: members: 1/1434
Sep 03 22:04:42 pmxcfs[1434]: [status] notice: members: 1/1434
Sep 03 22:04:43 corosync[1581]: [QUORUM] Members[1]: 1
Sep 03 22:04:43 corosync[1581]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 03 22:04:43 pve-ha-crm[1649]: node 'pmx02': state changed from 'online' => 'unknown'
Sep 03 22:05:33 pve-ha-crm[1649]: service 'vm:100': state changed from 'started' to 'fence'
Sep 03 22:05:33 pve-ha-crm[1649]: node 'pmx02': state changed from 'unknown' => 'fence'
Sep 03 22:05:33 perl[1649]: notified via target `mail-to-root`
-- Boot 45942e8b240446399f93207ad66036bd --

This time the node sent an email saying it was fencing the other node, which said:

Sep 03 22:04:26 pve-ha-lrm[2074]: VM 100 started with PID 2086.
Sep 03 22:04:26 pve-ha-lrm[2073]: <root@pam> end task UPID

mx02:0000081A:00002E57:68B901B9:qmstart:100:root@pam: OK
Sep 03 22:04:26 pve-ha-lrm[2073]: service status vm:100 started
-- Boot 3dbfcfb63ab2487a8f037c8625b71865 --

At this point, the ha-manager seems stuck trying to disable HA on the vm:

# ha-manager status
quorum OK
master pmx01 (idle, Wed Sep 3 22:06:43 2025)
lrm pmx01 (idle, Wed Sep 3 22:55:11 2025)
lrm pmx02 (wait_for_agent_lock, Wed Sep 3 22:55:11 2025)
service vm:100 (pmx02, deleting)

Based on advice from a forum post, I stopped the pve-ha-crm service on the nodes, removed the /etc/pve/ha/manager_status, and restarted the service, and it seems ok now:

# ha-manager status
quorum OK

When I went to test re-enabling HA on the vm, I noticed the window says "At least three quorum votes are recommended for reliable HA". I have two nodes and a qdevice?

# pvecm status
Cluster information
-------------------
Name: XXX
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Wed Sep 3 23:00:42 2025
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1.57
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate Qdevice

Membership information
----------------------
Nodeid Votes Qdevice Name
0x00000001 1 A,V,NMW 172.16.10.1 (local)
0x00000002 1 A,V,NMW 172.16.10.2
0x00000000 1 Qdevice

I see another post with the same situation that was never answered:

P

Thread 'At 3 Votes but still seeing "At least three quorum votes are recommended for reliable HA"'

Jan 7, 2025

I have set up a QDevice on a Raspberry Pi; with a 2 node cluster. All seems to be working and all nodes and the QDevice can see each other - see "PVECM STATUS" output below....

However, when I go to add a VM to HA, I am still getting the warning message - "At least three quorum votes are recommended for reliable HA"

Is this a problem; I thought this message would be suppressed once I have 3 votes available?

Code:

Cluster information
-------------------
Name:             Brampton-01
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date...

Is there something wrong with my cluster causing it to fail? I don't really want to try and test HA on the vm again until I figure out why everything crashed.

Thanks...

LnxBil · 2025-09-04T08:10:33+0200

Paul B. Henson said:
Sep 03 22:04:36 corosync[1581]: [KNET ] link: host: 2 link: 0 is down
Sep 03 22:04:36 corosync[1581]: [KNET ] link: host: 2 link: 1 is down
Sep 03 22:04:36 corosync[1581]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 03 22:04:36 corosync[1581]: [KNET ] host: host: 2 has no active links
Sep 03 22:04:36 corosync[1581]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 03 22:04:36 corosync[1581]: [KNET ] host: host: 2 has no active links

That's not good. Can you post more logs before this? I would like to know why the ports went down.

Do you use PCIe passthrough in the VM?

Paul B. Henson · 2025-09-04T22:34:15+0200

LnxBil said:
That's not good. Can you post more logs before this? I would like to know why the ports went down.

Do you use PCIe passthrough in the VM?

Those logs are from node 1; unless I misunderstand them they indicate node 1 lost contact with node 2, which isn't surprising as node 2 was sitting at a purple screen of death at that time 8-/. The last logs from node 2 before it crashed were slightly before the loss of contact logs from node 1:

Sep 03 21:54:51 pvedaemon[1641]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Sep 03 21:55:17 pvedaemon[1643]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout

No, there's no PCEi passthough. It's a pretty generic test vm running Debian.

Chronologically, I enabled HA on the vm which was on node 2, node 2 crashed right after, then node 1 crashed shortly after none 2. This happened again when I rebooted them, until after the second reboot I was able to disable HA on the vm before the crash.

Search

Search

Purple screen NMI when enabling HA

Paul B. Henson

New Member

Thread 'At 3 Votes but still seeing "At least three quorum votes are recommended for reliable HA"'

LnxBil

Distinguished Member

Paul B. Henson

New Member

We value your privacy