Cluster help

squadfer

New Member
Nov 18, 2020
2
1
1
35
I'm in the process of creating a 2 node cluster with a qdevice running from truenas. I do already have vm's running on one of my Nodes, node1, the one that I created the cluster via the GUI from. The 2nd node was a fresh install with some network settings already in place, Lan/Management/Corosync networks. I just joined node2 to my cluster via the GUI and afterwards I can no longer access the webgui from that node. I get an error indicating that "The site can't be reached 10.90.100.41 took too long to respond". I can ping that IP just fine. I can access the webgui from node 1 and I do see both nodes with green status's. When I drill down to node2 within the GUI I get communication failure errors shown in the screenshots provided. I can access the shell via the GUI for Node2. I can SSH into node2 via management network just fine.

Some information
Node1: yamato
management:10.90.100.42/24
Lan:10.90.20.2/24
corosync: 10.90.40.12/24
pvecm status
Code:
root@yamato:~# pvecm status
Quorum information
------------------
Date:             Tue Nov 17 19:42:54 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1/36
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.90.40.12 (local)
0x00000002          1 10.90.40.13

Node2: shinano
Management: 10.90.100.41/24
Lan: 10.90.20.4/24
corosync: 10.90.40.13/24
pvecm status
Code:
root@shinano:~# pvecm status
Quorum information
------------------
Date:             Tue Nov 17 19:42:29 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1/36
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.90.40.12
0x00000002          1 10.90.40.13 (local)

I did attempt to add the qdevice from node1. I'm adding the error here to see if it could also shed some light onto the issue. It errored out when it attempted to add Node2.

Code:
root@yamato:~# pvecm qdevice setup 10.90.40.11
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@10.90.40.11's password:

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'root@10.90.40.11'"
and check to make sure that only the key(s) you wanted were added.


INFO: initializing qnetd server
Certificate database (/etc/corosync/qnetd/nssdb) already exists. Delete it to initialize new db

INFO: copying CA cert and initializing on all nodes
Host key verification failed.

node 'yamato': Creating /etc/corosync/qdevice/net/nssdb
password file contains no data
node 'yamato': Creating new key and cert db
node 'yamato': Creating new noise file /etc/corosync/qdevice/net/nssdb/noise.txt
node 'yamato': Importing CA
INFO: generating cert request
Creating new certificate request


Generating key.  This may take a few moments...

Certificate request stored in /etc/corosync/qdevice/net/nssdb/qdevice-net-node.crq

INFO: copying exported cert request to qnetd server

INFO: sign and export cluster cert
Signing cluster certificate
Certificate stored in /etc/corosync/qnetd/nssdb/cluster-HomeCluster.crt

INFO: copy exported CRT

INFO: import certificate
Importing signed cluster certificate
Notice: Trust flag u is set automatically if the private key is present.
pk12util: PKCS12 EXPORT SUCCESSFUL
Certificate stored in /etc/corosync/qdevice/net/nssdb/qdevice-net-node.p12

INFO: copy and import pk12 cert to all nodes
Host key verification failed.
command 'ssh -o 'BatchMode=yes' -lroot 10.90.100.41 corosync-qdevice-net-certutil -m -c /etc/pve/qdevice-net-node.p12' failed: exit code 255

My guess is this has to do with the ssh key on node2 changing when it got added to the cluster.

Thanks
 

Attachments

  • ClusterGui.PNG
    ClusterGui.PNG
    122 KB · Views: 4
  • ClusterGuiNetworktab.PNG
    ClusterGuiNetworktab.PNG
    74 KB · Views: 4
Last edited:
I did some further investigation and reading. My hunch that this was ssh key related has been confirmed. I stumbled upon the command.
Code:
pvecm updatecerts
I ran this command on Node2 and it resolved my issues with the webgui. It has also resolved the error with adding the qdevice.

pvecm status now returns:
Code:
root@yamato:/# pvecm status
Quorum information
------------------
Date:             Tue Nov 17 21:22:54 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1/36
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW 10.90.40.12 (local)
0x00000002          1    A,V,NMW 10.90.40.13
0x00000000          1            Qdevice
 
  • Like
Reactions: Moonrise6220

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!