[SOLVED] Node reboots unexpectedly when quorum is lost

explorer

Member
Apr 26, 2021
4
0
6
40
Hello,

I have a 2+1 node setup (2 Proxmox nodes + 1 Qdevice on Raspberry Pi) and things were working quite well until one of the node went offline. Since that node is in repair, the cluster is running with 1 node and the Qdevice. I have not edited the cluster, the number of votes, etc. to compensate for one node being down. Everything is on LAN.

During this 1+1 situation, I have noticed that whenever the Raspberry Pi restarts (or is slow to respond), the node also restarts. I can reproduce this consistently.

I watched the logs, and it doesn't look like this is a safe reboot. Here are the logs (from the last boot):
Code:
Jul 09 15:59:42 Sparrow corosync[2307]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 09 15:59:42 Sparrow corosync[2307]:   [QUORUM] Members[1]: 1
Jul 09 15:59:42 Sparrow pmxcfs[1761]: [status] notice: node lost quorum
Jul 09 15:59:45 Sparrow corosync-qdevice[2345]: Can't connect to qnetd host. (-5986): Network address not available (in use?)
Jul 09 15:59:49 Sparrow pve-ha-crm[2381]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Jul 09 15:59:51 Sparrow pve-ha-lrm[2400]: lost lock 'ha_agent_Sparrow_lock - cfs lock update failed - Permission denied
Jul 09 15:59:53 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 15:59:54 Sparrow pve-ha-crm[2381]: status change master => lost_manager_lock
Jul 09 15:59:54 Sparrow pve-ha-crm[2381]: watchdog closed (disabled)
Jul 09 15:59:54 Sparrow pve-ha-crm[2381]: status change lost_manager_lock => wait_for_quorum
Jul 09 15:59:56 Sparrow pve-ha-lrm[2400]: status change active => lost_agent_lock
Jul 09 16:00:01 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 16:00:09 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 16:00:10 Sparrow pvescheduler[17650]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jul 09 16:00:10 Sparrow pvescheduler[17649]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 09 16:00:17 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 16:00:25 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 16:00:27 Sparrow corosync-qdevice[2345]: Can't connect to qnetd host. (-5986): Network address not available (in use?)
Jul 09 16:00:33 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 16:00:36 Sparrow corosync-qdevice[2345]: Can't connect to qnetd host. (-5986): Network address not available (in use?)
Jul 09 16:00:41 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 16:00:42 Sparrow watchdog-mux[1089]: client watchdog expired - disable watchdog updates
Jul 09 16:00:44 Sparrow corosync-qdevice[2345]: Can't connect to qnetd host. (-5986): Network address not available (in use?)
Jul 09 16:00:49 Sparrow corosync-qdevice[2345]: Connect timeout

Notice how the logs end abruptly. I am watching the logs over a remote session, but even if I use journalctl -b -1, I do not see anything more.

Thank you for looking into it.
 
You do have HA guests right? Then this behavior is expected. You can set all HA guests to "Ignored" until the other node is back.

After you have done so, the LRM status on the node (Datacenter -> HA) should switch from "active" back to "idle". Once in idle state, the node won't fence (hard reset) itself anymore if the Quorum is lost.
 
Works as advertised, your cluster has 3 votes.
With one node away there are 2 votes left, 2 from 3 is a majority.
If your raspi goes away there is only 1 vote left that is a minority, so it quits
 
Hi, follow up question:
Why is there no log that indicates that the node has been restarted by Proxmox? That will help a lot with diagnosis.
 
Hi, follow up question:
Why is there no log that indicates that the node has been restarted by Proxmox? That will help a lot with diagnosis.

Code:
journalctl -u watchdog-mux

And you will find at some point: client watchdog expired - disable watchdog updates

And it goes from there.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!