Hello,
I have a 2+1 node setup (2 Proxmox nodes + 1 Qdevice on Raspberry Pi) and things were working quite well until one of the node went offline. Since that node is in repair, the cluster is running with 1 node and the Qdevice. I have not edited the cluster, the number of votes, etc. to compensate for one node being down. Everything is on LAN.
During this 1+1 situation, I have noticed that whenever the Raspberry Pi restarts (or is slow to respond), the node also restarts. I can reproduce this consistently.
I watched the logs, and it doesn't look like this is a safe reboot. Here are the logs (from the last boot):
Notice how the logs end abruptly. I am watching the logs over a remote session, but even if I use
Thank you for looking into it.
I have a 2+1 node setup (2 Proxmox nodes + 1 Qdevice on Raspberry Pi) and things were working quite well until one of the node went offline. Since that node is in repair, the cluster is running with 1 node and the Qdevice. I have not edited the cluster, the number of votes, etc. to compensate for one node being down. Everything is on LAN.
During this 1+1 situation, I have noticed that whenever the Raspberry Pi restarts (or is slow to respond), the node also restarts. I can reproduce this consistently.
I watched the logs, and it doesn't look like this is a safe reboot. Here are the logs (from the last boot):
Code:
Jul 09 15:59:42 Sparrow corosync[2307]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jul 09 15:59:42 Sparrow corosync[2307]: [QUORUM] Members[1]: 1
Jul 09 15:59:42 Sparrow pmxcfs[1761]: [status] notice: node lost quorum
Jul 09 15:59:45 Sparrow corosync-qdevice[2345]: Can't connect to qnetd host. (-5986): Network address not available (in use?)
Jul 09 15:59:49 Sparrow pve-ha-crm[2381]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Jul 09 15:59:51 Sparrow pve-ha-lrm[2400]: lost lock 'ha_agent_Sparrow_lock - cfs lock update failed - Permission denied
Jul 09 15:59:53 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 15:59:54 Sparrow pve-ha-crm[2381]: status change master => lost_manager_lock
Jul 09 15:59:54 Sparrow pve-ha-crm[2381]: watchdog closed (disabled)
Jul 09 15:59:54 Sparrow pve-ha-crm[2381]: status change lost_manager_lock => wait_for_quorum
Jul 09 15:59:56 Sparrow pve-ha-lrm[2400]: status change active => lost_agent_lock
Jul 09 16:00:01 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 16:00:09 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 16:00:10 Sparrow pvescheduler[17650]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jul 09 16:00:10 Sparrow pvescheduler[17649]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jul 09 16:00:17 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 16:00:25 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 16:00:27 Sparrow corosync-qdevice[2345]: Can't connect to qnetd host. (-5986): Network address not available (in use?)
Jul 09 16:00:33 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 16:00:36 Sparrow corosync-qdevice[2345]: Can't connect to qnetd host. (-5986): Network address not available (in use?)
Jul 09 16:00:41 Sparrow corosync-qdevice[2345]: Connect timeout
Jul 09 16:00:42 Sparrow watchdog-mux[1089]: client watchdog expired - disable watchdog updates
Jul 09 16:00:44 Sparrow corosync-qdevice[2345]: Can't connect to qnetd host. (-5986): Network address not available (in use?)
Jul 09 16:00:49 Sparrow corosync-qdevice[2345]: Connect timeout
Notice how the logs end abruptly. I am watching the logs over a remote session, but even if I use
journalctl -b -1
, I do not see anything more.Thank you for looking into it.