Help restoring 4-node qdevice setup

May 23, 2012
19
0
41
Hi forum.

I'm setting up a 4 node cluster using a qdevice. Installation and setup went well, I setup firewall and some storages ... just to realize I have to reinstall all 4 nodes again because my provider does not installs ZFS by default.
So, by doing a rolling reinstall I'm successfully switching the cluster to ZFS quite easily... so far so good .... then I recall about qdevice setup... I forget about it completelly!

Now, the situation is somehow a mess:

I tried to remove the qdevice ... but it failed during the process
I tried to re-add the qdevice just in case but it fails too.

Now corosync-qdevice runs on two nodes, but fails to start on the other two...

the command 'pvecm qdevice remove' yields:

error during cfs-locked 'file-corosync_conf' operation: No QDevice configured!

but pvecm status reads:

Votequorum information ---------------------- Expected votes: 4 Highest expected: 4 Total votes: 4 Quorum: 3 Flags: Quorate Qdevice Membership information ---------------------- Nodeid Votes Qdevice Name 0x00000001 1 NA,NV,NMW -Deleted for privacy- 0x00000002 1 NA,NV,NMW -Deleted for privacy- 0x00000003 1 NA,NV,NMW -Deleted for privacy- 0x00000004 1 NA,NV,NMW -Deleted for privacy- (local) 0x00000000 0 Qdevice (votes 0)

Installing but fails too:

pvecm qdevice setup -Deleted for privacy- /bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub" /bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system. (if you think this is a mistake, you may want to use -f option) INFO: initializing qnetd server Certificate database (/etc/corosync/qnetd/nssdb) already exists. Delete it to initialize new db INFO: copying CA cert and initializing on all nodes Certificate database already exists. Delete it to continue Certificate database already exists. Delete it to continue Host key verification failed. Certificate database already exists. Delete it to continue INFO: generating cert request Certificate database doesn't exists. Use /sbin/corosync-qdevice-net-certutil -i to create it command 'corosync-qdevice-net-certutil -r -n iccbroadcast' failed: exit code 1

How could I 'reset' the situation?
I was thinking on following the advice of deleting /etc/corosync/qnetd/nssdb ... but I'm starting to fear screwing the whole corosync-cluster thing doing that kind of things ...so better ask before more shooting... is that the solution?
Ideally I would like to get rid of the qdevice completely and setup it again after all nodes are ready to go.


Thank you very much in advance.
Best regards.
 
Last edited:
Hi,

normal it should be enough to run setup command with "--force" flag

Code:
 pvecm qdevice setup <address> --force

if this does not help can you please send your corosync.conf?
And also what files are present in the /etc/corosync/ dir.
 
Hi,

normal it should be enough to run setup command with "--force" flag

Code:
pvecm qdevice setup <address> --force

if this does not help can you please send your corosync.conf?
And also what files are present in the /etc/corosync/ dir.

Hi, thank you for your help.
I missed out the pvecm correct syntax for options/modifiers: although I tried 'force' as pointed in the documentation, I missed the '--' .

Anyhow, I solved the problem which was two-folded:
- by one side I had to delete the /etc/corosync/qdevice/net/nssdb on all nodes
- by other side, I hit the missing SSH key in host file... so, ssh -o HostKeyAlias=X.X.X.X root@X.X.X:X


pvecm qdevice setup A.B.C.D

added the device... and then

pvecm qdevice remove

deleted the device cleanly


Still, the reinstallation of the nodes continues to cause what I already consider 'a classic OVH-Proxmox problem' (I think I've dealt with it from my early experiences at OVH/Proxmox 2.X): ssh host key verification.
The pvecm delete command does still not properly clean the SSH keys hosts file when deleting a node.
Although I change hostname upon re-installation, private/vRack IPs may need to stay the same (Public IPs, sure keep the same, they're what you got on your server) ... and, although nodes join the cluster fine, migration, shell and that kind of stuff does not work... you need to manually do a full-mess of deleting old keys and ssh to generate new ones, and still, examining the file, I see it has remains of the old nodes hostnames...
Probably it is very OVH-working-environment side-effect problem.

Again, thank you very much.
Regards.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!