[SOLVED] corosync-qdevice: Can't init nss: security library: bad database

Nov 5, 2021
12
5
8
44
Hello all,

A few weeks ago I upgraded my proxmox homelab setup to include a cluster. I have a primary node that stays online 24/7, I have a secondary node that is turned on only when reboots are needed or in case of issues with the primary node, and I have a docker container deployed on my NAS so that my cluster can maintain quorum when the secondary node is offline as it usually is. It took awhile to get the qdevice/qnetd setup working, mainly due to user error, but I did finally get it working. Both nodes recognized the vote from the Qdevice and quorum was always maintained when a node was down. I know for a fact that my configuration is correct because this setup worked correctly for weeks and survived multiple reboots of both the primary and secondary node as well as the NAS.

Today after updating my nodes and rebooting my primary node, the corocync-qdevice service failed on startup with the result "error-code". The errors shown when checking systemctl status corosync-qdevice.service all pertain to the start requests repeating too quickly. After checking journalctl -xe, I am seeing the following error when the service first attempts to restart:

corosync-qdevice[92636]: Can't init nss (-8174): security library: bad database.

This is occurring only on the primary node, the secondary node is having no issues at all with the cluster and neither is the qnetd server (no errors in any logs and pvecm status when run on the secondary node shows all 3 members having a vote). I was having issues like this during the process of setting the qdevice/qnetd system up but I wasn't sure what was user error and what was not at the time, and some arcane combination of uninstalling and deleting everything qdevice/qnetd related followed by reinstalling everything managed to fix it. This time I want to actually understand what this error means and what can be done to resolve it rather than blow everything away and start over. I want to figure out the actual issue with the primary node because clearly there is something wrong with it specifically and if I just uninstall/reinstall everything again I can randomly wind up back in the same situation I am in now with a broken cluster.

Here is what I can say for certain:

  1. My setup does not have any configuration issues (firewalls, ssh/sshd). I had no issues with the 3-member cluster for weeks and I have not changed any of their configurations.

  2. Both nodes are fully up to date and using the no-subscription repository.

  3. I have run pvecm updatecerts on both nodes, no effect.

  4. I have checked the /etc/pve/corosync.conf on both nodes, they are identical.

  5. pvecm status shows that the Qdevice has a vote when run on the secondary node, but not on the primary node.

  6. If I run systemctl reset-failed corosync-qdevice.service and then systemctl restart corosync-qdevice.service, it fails again immediately with the same error message.

  7. Googling the above error message reveals essentially nothing whatsoever other than a few other proxmox forum posts where others got the same error, but nobody has ever dug into what exactly that error message means and resolved it through some other means (restarting nodes, running pvecm updatecerts, uninstall/reinstall everything qdevice/qnetd related, and one person copied their entire /etc/corosync path from the node that worked correctly to the node that didn't work which to me seems heavy-handed and might cause other issues).
Does anyone know what this error actually means and how to resolve it specifically? I am happy to provide any other details requested, would love to get to the bottom of this because it seems I am not the only one having this issue.


corosync-qdevice[92636]: Can't init nss (-8174): security library: bad database.
 
I figured it out! Replying to myself in case this is useful to anyone.

While reading the Ubuntu manpage for corosync-qdevice, I noticed a section that says this:

Depending on configuration of NSS (stored in nss.config file usually in /etc/crypto-
policies/back-ends/ directory) disabled ciphers or too short keys may be rejected. Proper
solution is to regenerate NSS databases for both corosync-qnetd and corosync-qdevice
daemons. As a quick workaround it's also possible to set environment variable
NSS_IGNORE_SYSTEM_POLICY=1 before running corosync-qdevice daemon.

When NSS is updated it may also be needed to upgrade database into new format. There is no
consensus on recommended way, but following command seems to work just fine (if qdevice
sysconfdir is set to /etc)

# certutil -N -d /etc/corosync/qdevice/net/nssdb -f /etc/corosync/qdevice/net/nssdb/pwdfile.txt

That section didn't exactly explain my issue but I wasn't aware that there is an NSS database for both the qnetd server as well as the qdevices, I thought there was only one for the qnetd. This nuance makes perfect sense now that I am aware of it but I didn't realize that until I read that manpage. I was assuming that the issue was really something else and that there weren't any NSS issues because node 2 and the qdevice could communicate just fine. The real issue was the NSS db on node 1.

This led to me investigating the path /etc/corosync/qdevice/net/nssdb on both of my Proxmox nodes and made it clear why the solution in the other Proxmox forum post I found actually worked: my main node did not have a qdevice NSS db at all! For some reason node 2 had one but node 1 did not. I resolved the issue by taking the following steps:

  1. From node 2, scp the contents of /etc/corosync/qdevice/net/nssdb to the same path on node 1.

  2. From node 1, systemctl reset-failed corosync-qdevice.service and systemctl reset corosync-qdevice.service.

  3. pvecm status now shows the qdevice as having a vote from both nodes!
I still don't know how this actually broke, there must have been a working NSS db on node 1 at some point if it worked correctly in the past and I am not sure what could have caused those files to disappear (or how the setup could have worked correctly in the past without those files if they were indeed never there), but at least I was able to resolve it for now and know where to look in the future if this issue happens again.
 
Last edited:
  • Like
Reactions: Hostip and Neobin

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!