Help restoring 4-node qdevice setup

iccbroadcast · Sep 4, 2020

Hi forum.

I'm setting up a 4 node cluster using a qdevice. Installation and setup went well, I setup firewall and some storages ... just to realize I have to reinstall all 4 nodes again because my provider does not installs ZFS by default.
So, by doing a rolling reinstall I'm successfully switching the cluster to ZFS quite easily... so far so good .... then I recall about qdevice setup... I forget about it completelly!

Now, the situation is somehow a mess:

I tried to remove the qdevice ... but it failed during the process
I tried to re-add the qdevice just in case but it fails too.

Now corosync-qdevice runs on two nodes, but fails to start on the other two...

the command 'pvecm qdevice remove' yields:

error during cfs-locked 'file-corosync_conf' operation: No QDevice configured!

but pvecm status reads:


Votequorum information
----------------------
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1  NA,NV,NMW -Deleted for privacy-
0x00000002          1  NA,NV,NMW -Deleted for privacy-
0x00000003          1  NA,NV,NMW -Deleted for privacy-
0x00000004          1  NA,NV,NMW -Deleted for privacy- (local)
0x00000000          0            Qdevice (votes 0)

Installing but fails too:


pvecm qdevice setup -Deleted for privacy-
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
        (if you think this is a mistake, you may want to use -f option)


INFO: initializing qnetd server
Certificate database (/etc/corosync/qnetd/nssdb) already exists. Delete it to initialize new db

INFO: copying CA cert and initializing on all nodes
Certificate database already exists. Delete it to continue
Certificate database already exists. Delete it to continue
Host key verification failed.
Certificate database already exists. Delete it to continue

INFO: generating cert request
Certificate database doesn't exists. Use /sbin/corosync-qdevice-net-certutil -i to create it
command 'corosync-qdevice-net-certutil -r -n iccbroadcast' failed: exit code 1

How could I 'reset' the situation?
I was thinking on following the advice of deleting /etc/corosync/qnetd/nssdb ... but I'm starting to fear screwing the whole corosync-cluster thing doing that kind of things ...so better ask before more shooting... is that the solution?
Ideally I would like to get rid of the qdevice completely and setup it again after all nodes are ready to go.

Thank you very much in advance.
Best regards.

wolfgang · Sep 8, 2020

Hi,

normal it should be enough to run setup command with "--force" flag

Code:

 pvecm qdevice setup <address> --force

if this does not help can you please send your corosync.conf?
And also what files are present in the /etc/corosync/ dir.

alexolivan · Sep 9, 2020

wolfgang said:
Hi,

normal it should be enough to run setup command with "--force" flag

Code:

pvecm qdevice setup <address> --force

if this does not help can you please send your corosync.conf?
And also what files are present in the /etc/corosync/ dir.

Hi, thank you for your help.
I missed out the pvecm correct syntax for options/modifiers: although I tried 'force' as pointed in the documentation, I missed the '--' .

Anyhow, I solved the problem which was two-folded:
- by one side I had to delete the /etc/corosync/qdevice/net/nssdb on all nodes
- by other side, I hit the missing SSH key in host file... so, ssh -o HostKeyAlias=X.X.X.X root@X.X.X:X

pvecm qdevice setup A.B.C.D

added the device... and then

pvecm qdevice remove

deleted the device cleanly

Still, the reinstallation of the nodes continues to cause what I already consider 'a classic OVH-Proxmox problem' (I think I've dealt with it from my early experiences at OVH/Proxmox 2.X): ssh host key verification.
The pvecm delete command does still not properly clean the SSH keys hosts file when deleting a node.
Although I change hostname upon re-installation, private/vRack IPs may need to stay the same (Public IPs, sure keep the same, they're what you got on your server) ... and, although nodes join the cluster fine, migration, shell and that kind of stuff does not work... you need to manually do a full-mess of deleting old keys and ssh to generate new ones, and still, examining the file, I see it has remains of the old nodes hostnames...
Probably it is very OVH-working-environment side-effect problem.

Again, thank you very much.
Regards.

Search

Search

Help restoring 4-node qdevice setup

iccbroadcast

Renowned Member

wolfgang

Proxmox Retired Staff

alexolivan

Renowned Member

We value your privacy