[SOLVED] Node reboots unexpectedly when quorum is lost

explorer · Jul 9, 2024

Hello,

I have a 2+1 node setup (2 Proxmox nodes + 1 Qdevice on Raspberry Pi) and things were working quite well until one of the node went offline. Since that node is in repair, the cluster is running with 1 node and the Qdevice. I have not edited the cluster, the number of votes, etc. to compensate for one node being down. Everything is on LAN.

During this 1+1 situation, I have noticed that whenever the Raspberry Pi restarts (or is slow to respond), the node also restarts. I can reproduce this consistently.

I watched the logs, and it doesn't look like this is a safe reboot. Here are the logs (from the last boot):

Code:

Jul 09 15:59:42 Sparrow corosync[2307]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 09 15:59:42 Sparrow corosync[2307]:   [QUORUM] Members[1]: 1
Jul 09 15:59:42 Sparrow pmxcfs[1761]: [status] notice: node lost quorum
Jul 09 15:59:45 Sparrow corosync-qdevice[2345]: Can't connect to qnetd host. (-5986): Network address not available (in use?)
Jul 09 15:59:49 Sparrow pve-ha-crm[2381]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Jul 09 15:59:51 Sparrow pve-ha-lrm[2400]: lost lock 'ha_agent_Sparrow_lock - cfs lock update failed - Permission denied
Jul 09 15:59:53 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 15:59:54 Sparrow pve-ha-crm[2381]: status change master => lost_manager_lock
Jul 09 15:59:54 Sparrow pve-ha-crm[2381]: watchdog closed (disabled)
Jul 09 15:59:54 Sparrow pve-ha-crm[2381]: status change lost_manager_lock => wait_for_quorum
Jul 09 15:59:56 Sparrow pve-ha-lrm[2400]: status change active => lost_agent_lock
Jul 09 16:00:01 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 16:00:09 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 16:00:10 Sparrow pvescheduler[17650]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jul 09 16:00:10 Sparrow pvescheduler[17649]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 09 16:00:17 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 16:00:25 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 16:00:27 Sparrow corosync-qdevice[2345]: Can't connect to qnetd host. (-5986): Network address not available (in use?)
Jul 09 16:00:33 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 16:00:36 Sparrow corosync-qdevice[2345]: Can't connect to qnetd host. (-5986): Network address not available (in use?)
Jul 09 16:00:41 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 16:00:42 Sparrow watchdog-mux[1089]: client watchdog expired - disable watchdog updates
Jul 09 16:00:44 Sparrow corosync-qdevice[2345]: Can't connect to qnetd host. (-5986): Network address not available (in use?)
Jul 09 16:00:49 Sparrow corosync-qdevice[2345]: Connect timeout

Notice how the logs end abruptly. I am watching the logs over a remote session, but even if I use journalctl -b -1, I do not see anything more.

Thank you for looking into it.

aaron · Jul 9, 2024

You do have HA guests right? Then this behavior is expected. You can set all HA guests to "Ignored" until the other node is back.

After you have done so, the LRM status on the node (Datacenter -> HA) should switch from "active" back to "idle". Once in idle state, the node won't fence (hard reset) itself anymore if the Quorum is lost.

ubu · Jul 9, 2024

Works as advertised, your cluster has 3 votes.
With one node away there are 2 votes left, 2 from 3 is a majority.
If your raspi goes away there is only 1 vote left that is a minority, so it quits

explorer · Jul 10, 2024

Thank you very much.

explorer · Aug 1, 2024

Hi, follow up question:
Why is there no log that indicates that the node has been restarted by Proxmox? That will help a lot with diagnosis.

esi_y · Aug 1, 2024

explorer said:
Hi, follow up question:
Why is there no log that indicates that the node has been restarted by Proxmox? That will help a lot with diagnosis.

Code:

journalctl -u watchdog-mux

And you will find at some point: client watchdog expired - disable watchdog updates

And it goes from there.

Search

Search

[SOLVED] Node reboots unexpectedly when quorum is lost

explorer

Member

aaron

Proxmox Staff Member

ubu

Renowned Member

explorer

Member

explorer

Member

esi_y

Renowned Member

We value your privacy