Hello all,
A few weeks ago I upgraded my proxmox homelab setup to include a cluster. I have a primary node that stays online 24/7, I have a secondary node that is turned on only when reboots are needed or in case of issues with the primary node, and I have a docker container deployed on my NAS so that my cluster can maintain quorum when the secondary node is offline as it usually is. It took awhile to get the qdevice/qnetd setup working, mainly due to user error, but I did finally get it working. Both nodes recognized the vote from the Qdevice and quorum was always maintained when a node was down. I know for a fact that my configuration is correct because this setup worked correctly for weeks and survived multiple reboots of both the primary and secondary node as well as the NAS.
Today after updating my nodes and rebooting my primary node, the corocync-qdevice service failed on startup with the result "error-code". The errors shown when checking
This is occurring only on the primary node, the secondary node is having no issues at all with the cluster and neither is the qnetd server (no errors in any logs and
Here is what I can say for certain:
A few weeks ago I upgraded my proxmox homelab setup to include a cluster. I have a primary node that stays online 24/7, I have a secondary node that is turned on only when reboots are needed or in case of issues with the primary node, and I have a docker container deployed on my NAS so that my cluster can maintain quorum when the secondary node is offline as it usually is. It took awhile to get the qdevice/qnetd setup working, mainly due to user error, but I did finally get it working. Both nodes recognized the vote from the Qdevice and quorum was always maintained when a node was down. I know for a fact that my configuration is correct because this setup worked correctly for weeks and survived multiple reboots of both the primary and secondary node as well as the NAS.
Today after updating my nodes and rebooting my primary node, the corocync-qdevice service failed on startup with the result "error-code". The errors shown when checking
systemctl status corosync-qdevice.service
all pertain to the start requests repeating too quickly. After checking journalctl -xe
, I am seeing the following error when the service first attempts to restart:corosync-qdevice[92636]: Can't init nss (-8174): security library: bad database.
This is occurring only on the primary node, the secondary node is having no issues at all with the cluster and neither is the qnetd server (no errors in any logs and
pvecm status
when run on the secondary node shows all 3 members having a vote). I was having issues like this during the process of setting the qdevice/qnetd system up but I wasn't sure what was user error and what was not at the time, and some arcane combination of uninstalling and deleting everything qdevice/qnetd related followed by reinstalling everything managed to fix it. This time I want to actually understand what this error means and what can be done to resolve it rather than blow everything away and start over. I want to figure out the actual issue with the primary node because clearly there is something wrong with it specifically and if I just uninstall/reinstall everything again I can randomly wind up back in the same situation I am in now with a broken cluster.Here is what I can say for certain:
- My setup does not have any configuration issues (firewalls, ssh/sshd). I had no issues with the 3-member cluster for weeks and I have not changed any of their configurations.
- Both nodes are fully up to date and using the no-subscription repository.
- I have run
pvecm updatecerts
on both nodes, no effect.
- I have checked the
/etc/pve/corosync.conf
on both nodes, they are identical.
pvecm status
shows that the Qdevice has a vote when run on the secondary node, but not on the primary node.
- If I run
systemctl reset-failed corosync-qdevice.service
and thensystemctl restart corosync-qdevice.service
, it fails again immediately with the same error message.
- Googling the above error message reveals essentially nothing whatsoever other than a few other proxmox forum posts where others got the same error, but nobody has ever dug into what exactly that error message means and resolved it through some other means (restarting nodes, running
pvecm updatecerts
, uninstall/reinstall everything qdevice/qnetd related, and one person copied their entire /etc/corosync path from the node that worked correctly to the node that didn't work which to me seems heavy-handed and might cause other issues).
corosync-qdevice[92636]: Can't init nss (-8174): security library: bad database.