We recently set up a proxmox cluster for the first time. There are two nodes and a qdevice. Everything was working great until I tested enabling HA on a vm. A minute or so after, both proxmox nodes ended up at a purple kernel crash screen saying there had been an NMI. The server SP reported the watchdog had fired on each (I thought the watchdog forced a reboot though, as opposed to a kernel crash screen).
I hard reset the nodes, they came up, and immediately purple screened again. I reset them again, and this time was able to get into the webui and disable HA on that vm and they stopped crashing. Looking through the logs, I'm not clear what happened? One node shows:
Sep 03 21:56:17 corosync[1580]: [KNET ] link: host: 2 link: 0 is down
Sep 03 21:56:17 corosync[1580]: [KNET ] link: host: 2 link: 1 is down
Sep 03 21:56:17 corosync[1580]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 03 21:56:17 corosync[1580]: [KNET ] host: host: 2 has no active links
Sep 03 21:56:17 corosync[1580]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 03 21:56:17 corosync[1580]: [KNET ] host: host: 2 has no active links
Sep 03 21:56:18 corosync[1580]: [TOTEM ] Token has not been received in 2250 ms
Sep 03 21:56:19 corosync[1580]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensu>
Sep 03 21:56:23 corosync[1580]: [QUORUM] Sync members[1]: 1
Sep 03 21:56:23 corosync[1580]: [QUORUM] Sync left[1]: 2
Sep 03 21:56:23 corosync[1580]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Sep 03 21:56:23 corosync[1580]: [TOTEM ] A new membership (1.41) was formed. Members left: 2
Sep 03 21:56:23 corosync[1580]: [TOTEM ] Failed to receive the leave message. failed: 2
Sep 03 21:56:23 pmxcfs[1433]: [dcdb] notice: members: 1/1433
Sep 03 21:56:23 pmxcfs[1433]: [status] notice: members: 1/1433
Sep 03 21:56:23 corosync[1580]: [QUORUM] Members[1]: 1
Sep 03 21:56:23 corosync[1580]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 03 21:56:23 pve-ha-crm[1651]: status change wait_for_quorum => slave
-- Boot f86039067cd149e8a29471c317c0d802 --
I think this is the node that crashed second, saying the other node went away. The node I think crashed first says:
Sep 03 21:54:51 pvedaemon[1641]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Sep 03 21:55:17 pvedaemon[1643]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
-- Boot d770df1e75e14c32833b4e3cab5ac9f0 --
The second time around, the first node said:
Sep 03 22:04:36 corosync[1581]: [KNET ] link: host: 2 link: 0 is down
Sep 03 22:04:36 corosync[1581]: [KNET ] link: host: 2 link: 1 is down
Sep 03 22:04:36 corosync[1581]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 03 22:04:36 corosync[1581]: [KNET ] host: host: 2 has no active links
Sep 03 22:04:36 corosync[1581]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 03 22:04:36 corosync[1581]: [KNET ] host: host: 2 has no active links
Sep 03 22:04:37 corosync[1581]: [TOTEM ] Token has not been received in 2250 ms
Sep 03 22:04:38 corosync[1581]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensu>
Sep 03 22:04:42 corosync[1581]: [QUORUM] Sync members[1]: 1
Sep 03 22:04:42 corosync[1581]: [QUORUM] Sync left[1]: 2
Sep 03 22:04:42 corosync[1581]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Sep 03 22:04:42 corosync[1581]: [TOTEM ] A new membership (1.4e) was formed. Members left: 2
Sep 03 22:04:42 corosync[1581]: [TOTEM ] Failed to receive the leave message. failed: 2
Sep 03 22:04:42 pmxcfs[1434]: [dcdb] notice: members: 1/1434
Sep 03 22:04:42 pmxcfs[1434]: [status] notice: members: 1/1434
Sep 03 22:04:43 corosync[1581]: [QUORUM] Members[1]: 1
Sep 03 22:04:43 corosync[1581]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 03 22:04:43 pve-ha-crm[1649]: node 'pmx02': state changed from 'online' => 'unknown'
Sep 03 22:05:33 pve-ha-crm[1649]: service 'vm:100': state changed from 'started' to 'fence'
Sep 03 22:05:33 pve-ha-crm[1649]: node 'pmx02': state changed from 'unknown' => 'fence'
Sep 03 22:05:33 perl[1649]: notified via target `mail-to-root`
-- Boot 45942e8b240446399f93207ad66036bd --
This time the node sent an email saying it was fencing the other node, which said:
Sep 03 22:04:26 pve-ha-lrm[2074]: VM 100 started with PID 2086.
Sep 03 22:04:26 pve-ha-lrm[2073]: <root@pam> end task UPID
mx02:0000081A:00002E57:68B901B9:qmstart:100:root@pam: OK
Sep 03 22:04:26 pve-ha-lrm[2073]: service status vm:100 started
-- Boot 3dbfcfb63ab2487a8f037c8625b71865 --
At this point, the ha-manager seems stuck trying to disable HA on the vm:
# ha-manager status
quorum OK
master pmx01 (idle, Wed Sep 3 22:06:43 2025)
lrm pmx01 (idle, Wed Sep 3 22:55:11 2025)
lrm pmx02 (wait_for_agent_lock, Wed Sep 3 22:55:11 2025)
service vm:100 (pmx02, deleting)
Based on advice from a forum post, I stopped the pve-ha-crm service on the nodes, removed the /etc/pve/ha/manager_status, and restarted the service, and it seems ok now:
# ha-manager status
quorum OK
When I went to test re-enabling HA on the vm, I noticed the window says "At least three quorum votes are recommended for reliable HA". I have two nodes and a qdevice?
# pvecm status
Cluster information
-------------------
Name: XXX
Config Version: 3
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Wed Sep 3 23:00:42 2025
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1.57
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate Qdevice
Membership information
----------------------
Nodeid Votes Qdevice Name
0x00000001 1 A,V,NMW 172.16.10.1 (local)
0x00000002 1 A,V,NMW 172.16.10.2
0x00000000 1 Qdevice
I see another post with the same situation that was never answered:
Is there something wrong with my cluster causing it to fail? I don't really want to try and test HA on the vm again until I figure out why everything crashed.
Thanks...
I hard reset the nodes, they came up, and immediately purple screened again. I reset them again, and this time was able to get into the webui and disable HA on that vm and they stopped crashing. Looking through the logs, I'm not clear what happened? One node shows:
Sep 03 21:56:17 corosync[1580]: [KNET ] link: host: 2 link: 0 is down
Sep 03 21:56:17 corosync[1580]: [KNET ] link: host: 2 link: 1 is down
Sep 03 21:56:17 corosync[1580]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 03 21:56:17 corosync[1580]: [KNET ] host: host: 2 has no active links
Sep 03 21:56:17 corosync[1580]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 03 21:56:17 corosync[1580]: [KNET ] host: host: 2 has no active links
Sep 03 21:56:18 corosync[1580]: [TOTEM ] Token has not been received in 2250 ms
Sep 03 21:56:19 corosync[1580]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensu>
Sep 03 21:56:23 corosync[1580]: [QUORUM] Sync members[1]: 1
Sep 03 21:56:23 corosync[1580]: [QUORUM] Sync left[1]: 2
Sep 03 21:56:23 corosync[1580]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Sep 03 21:56:23 corosync[1580]: [TOTEM ] A new membership (1.41) was formed. Members left: 2
Sep 03 21:56:23 corosync[1580]: [TOTEM ] Failed to receive the leave message. failed: 2
Sep 03 21:56:23 pmxcfs[1433]: [dcdb] notice: members: 1/1433
Sep 03 21:56:23 pmxcfs[1433]: [status] notice: members: 1/1433
Sep 03 21:56:23 corosync[1580]: [QUORUM] Members[1]: 1
Sep 03 21:56:23 corosync[1580]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 03 21:56:23 pve-ha-crm[1651]: status change wait_for_quorum => slave
-- Boot f86039067cd149e8a29471c317c0d802 --
I think this is the node that crashed second, saying the other node went away. The node I think crashed first says:
Sep 03 21:54:51 pvedaemon[1641]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Sep 03 21:55:17 pvedaemon[1643]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
-- Boot d770df1e75e14c32833b4e3cab5ac9f0 --
The second time around, the first node said:
Sep 03 22:04:36 corosync[1581]: [KNET ] link: host: 2 link: 0 is down
Sep 03 22:04:36 corosync[1581]: [KNET ] link: host: 2 link: 1 is down
Sep 03 22:04:36 corosync[1581]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 03 22:04:36 corosync[1581]: [KNET ] host: host: 2 has no active links
Sep 03 22:04:36 corosync[1581]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Sep 03 22:04:36 corosync[1581]: [KNET ] host: host: 2 has no active links
Sep 03 22:04:37 corosync[1581]: [TOTEM ] Token has not been received in 2250 ms
Sep 03 22:04:38 corosync[1581]: [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensu>
Sep 03 22:04:42 corosync[1581]: [QUORUM] Sync members[1]: 1
Sep 03 22:04:42 corosync[1581]: [QUORUM] Sync left[1]: 2
Sep 03 22:04:42 corosync[1581]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Sep 03 22:04:42 corosync[1581]: [TOTEM ] A new membership (1.4e) was formed. Members left: 2
Sep 03 22:04:42 corosync[1581]: [TOTEM ] Failed to receive the leave message. failed: 2
Sep 03 22:04:42 pmxcfs[1434]: [dcdb] notice: members: 1/1434
Sep 03 22:04:42 pmxcfs[1434]: [status] notice: members: 1/1434
Sep 03 22:04:43 corosync[1581]: [QUORUM] Members[1]: 1
Sep 03 22:04:43 corosync[1581]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 03 22:04:43 pve-ha-crm[1649]: node 'pmx02': state changed from 'online' => 'unknown'
Sep 03 22:05:33 pve-ha-crm[1649]: service 'vm:100': state changed from 'started' to 'fence'
Sep 03 22:05:33 pve-ha-crm[1649]: node 'pmx02': state changed from 'unknown' => 'fence'
Sep 03 22:05:33 perl[1649]: notified via target `mail-to-root`
-- Boot 45942e8b240446399f93207ad66036bd --
This time the node sent an email saying it was fencing the other node, which said:
Sep 03 22:04:26 pve-ha-lrm[2074]: VM 100 started with PID 2086.
Sep 03 22:04:26 pve-ha-lrm[2073]: <root@pam> end task UPID

Sep 03 22:04:26 pve-ha-lrm[2073]: service status vm:100 started
-- Boot 3dbfcfb63ab2487a8f037c8625b71865 --
At this point, the ha-manager seems stuck trying to disable HA on the vm:
# ha-manager status
quorum OK
master pmx01 (idle, Wed Sep 3 22:06:43 2025)
lrm pmx01 (idle, Wed Sep 3 22:55:11 2025)
lrm pmx02 (wait_for_agent_lock, Wed Sep 3 22:55:11 2025)
service vm:100 (pmx02, deleting)
Based on advice from a forum post, I stopped the pve-ha-crm service on the nodes, removed the /etc/pve/ha/manager_status, and restarted the service, and it seems ok now:
# ha-manager status
quorum OK
When I went to test re-enabling HA on the vm, I noticed the window says "At least three quorum votes are recommended for reliable HA". I have two nodes and a qdevice?
# pvecm status
Cluster information
-------------------
Name: XXX
Config Version: 3
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Wed Sep 3 23:00:42 2025
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1.57
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate Qdevice
Membership information
----------------------
Nodeid Votes Qdevice Name
0x00000001 1 A,V,NMW 172.16.10.1 (local)
0x00000002 1 A,V,NMW 172.16.10.2
0x00000000 1 Qdevice
I see another post with the same situation that was never answered:
I have set up a QDevice on a Raspberry Pi; with a 2 node cluster. All seems to be working and all nodes and the QDevice can see each other - see "PVECM STATUS" output below....
However, when I go to add a VM to HA, I am still getting the warning message - "At least three quorum votes are recommended for reliable HA"
Is this a problem; I thought this message would be suppressed once I have 3 votes available?
However, when I go to add a VM to HA, I am still getting the warning message - "At least three quorum votes are recommended for reliable HA"
Is this a problem; I thought this message would be suppressed once I have 3 votes available?
Code:
Cluster information
-------------------
Name: Brampton-01
Config Version: 3
Transport: knet
Secure auth: on
Quorum information
------------------
Date...
- PhilC
- cluster qdevice
- Replies: 0
- Forum: Proxmox VE: Installation and configuration
Is there something wrong with my cluster causing it to fail? I don't really want to try and test HA on the vm again until I figure out why everything crashed.
Thanks...
Last edited: