I have a problem getting a QDevice to work on proxmox 6.2-12
First I install the QDevice package on the 3rd witness (Raspberry Pi OS 20-08-2020) box:
I install the packages on the cluster machines as per the instructions and get this nice looking output on the cluster node:
Here we run into the problem:
And the error in the if I run the command manually in debug mode is 'error 18 disconnected from server.'
Seems the service wont start for some reason, is this a bug? I cant find any communication issues between the nodes and Q but I am open to suggestions where to look next, everything seems ok apart from the service wont start on any of the nodes.
First I install the QDevice package on the 3rd witness (Raspberry Pi OS 20-08-2020) box:
Code:
# apt install corosync-qnetd
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
corosync-qnetd
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 49.3 kB of archives.
After this operation, 153 kB of additional disk space will be used.
Get:1 http://mirror.init7.net/raspbian/raspbian buster/main armhf corosync-qnetd armhf 3.0.0-4+deb10u1 [49.3 kB]
Fetched 49.3 kB in 1s (85.5 kB/s)
Selecting previously unselected package corosync-qnetd.
(Reading database ... 40387 files and directories currently installed.)
Preparing to unpack .../corosync-qnetd_3.0.0-4+deb10u1_armhf.deb ...
Unpacking corosync-qnetd (3.0.0-4+deb10u1) ...
Setting up corosync-qnetd (3.0.0-4+deb10u1) ...
Creating /etc/corosync/qnetd/nssdb
Creating new key and cert db
password file contains no data
Creating new noise file /etc/corosync/qnetd/nssdb/noise.txt
Creating new CA
Generating key. This may take a few moments...
Is this a CA certificate [y/N]?
Enter the path length constraint, enter to skip [<0 for unlimited path]: > Is this a critical extension [y/N]?
Generating key. This may take a few moments...
Notice: Trust flag u is set automatically if the private key is present.
QNetd CA certificate is exported as /etc/corosync/qnetd/nssdb/qnetd-cacert.crt
Created symlink /etc/systemd/system/multi-user.target.wants/corosync-qnetd.service ? /lib/systemd/system/corosync-qnetd.service.
Processing triggers for man-db (2.8.5-2) ...
Processing triggers for systemd (241-7~deb10u4+rpi1) ...
I install the packages on the cluster machines as per the instructions and get this nice looking output on the cluster node:
Code:
# pvecm qdevice setup 10.192.3.243
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
(if you think this is a mistake, you may want to use -f option)
INFO: initializing qnetd server
Certificate database (/etc/corosync/qnetd/nssdb) already exists. Delete it to initialize new db
INFO: copying CA cert and initializing on all nodes
node 'srv-pve-01': Creating /etc/corosync/qdevice/net/nssdb
password file contains no data
node 'srv-pve-01': Creating new key and cert db
node 'srv-pve-01': Creating new noise file /etc/corosync/qdevice/net/nssdb/noise.txt
node 'srv-pve-01': Importing CA
node 'srv-pve-02': Creating /etc/corosync/qdevice/net/nssdb
password file contains no data
node 'srv-pve-02': Creating new key and cert db
node 'srv-pve-02': Creating new noise file /etc/corosync/qdevice/net/nssdb/noise.txt
node 'srv-pve-02': Importing CA
INFO: generating cert request
Creating new certificate request
Generating key. This may take a few moments...
Certificate request stored in /etc/corosync/qdevice/net/nssdb/qdevice-net-node.crq
INFO: copying exported cert request to qnetd server
INFO: sign and export cluster cert
Signing cluster certificate
Certificate stored in /etc/corosync/qnetd/nssdb/cluster-modal.crt
INFO: copy exported CRT
INFO: import certificate
Importing signed cluster certificate
Notice: Trust flag u is set automatically if the private key is present.
pk12util: PKCS12 EXPORT SUCCESSFUL
Certificate stored in /etc/corosync/qdevice/net/nssdb/qdevice-net-node.p12
INFO: copy and import pk12 cert to all nodes
node 'srv-pve-01': Importing cluster certificate and key
node 'srv-pve-01': pk12util: PKCS12 IMPORT SUCCESSFUL
node 'srv-pve-02': Importing cluster certificate and key
node 'srv-pve-02': pk12util: PKCS12 IMPORT SUCCESSFUL
INFO: add QDevice to cluster configuration
INFO: start and enable corosync qdevice daemon on node 'srv-pve-01'...
Created symlink /etc/systemd/system/multi-user.target.wants/corosync-qdevice.service -> /lib/systemd/system/corosync-qdevice.service.
INFO: start and enable corosync qdevice daemon on node 'srv-pve-02'...
Created symlink /etc/systemd/system/multi-user.target.wants/corosync-qdevice.service -> /lib/systemd/system/corosync-qdevice.service.
Reloading corosync.conf...
Done
Here we run into the problem:
Code:
service corosync-qdevice status
● corosync-qdevice.service - Corosync Qdevice daemon
Loaded: loaded (/lib/systemd/system/corosync-qdevice.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2020-11-11 20:32:44 CET; 29s ago
Docs: man:corosync-qdevice
Process: 25363 ExecStart=/usr/sbin/corosync-qdevice -f $COROSYNC_QDEVICE_OPTIONS (code=exited, status=2)
Main PID: 25363 (code=exited, status=2)
Nov 11 20:32:44 srv-pve-02 systemd[1]: corosync-qdevice.service: Service RestartSec=100ms expired, scheduling restart.
Nov 11 20:32:44 srv-pve-02 systemd[1]: corosync-qdevice.service: Scheduled restart job, restart counter is at 6.
Nov 11 20:32:44 srv-pve-02 systemd[1]: Stopped Corosync Qdevice daemon.
Nov 11 20:32:44 srv-pve-02 systemd[1]: corosync-qdevice.service: Start request repeated too quickly.
Nov 11 20:32:44 srv-pve-02 systemd[1]: corosync-qdevice.service: Failed with result 'exit-code'.
Nov 11 20:32:44 srv-pve-02 systemd[1]: Failed to start Corosync Qdevice daemon.
And the error in the if I run the command manually in debug mode is 'error 18 disconnected from server.'
Code:
# /usr/sbin/corosync-qdevice -f -d
Nov 11 20:39:45 debug Initializing votequorum
Nov 11 20:39:45 debug Initializing local socket
Nov 11 20:39:45 debug Registering qdevice models
Nov 11 20:39:45 debug Configuring qdevice
Nov 11 20:39:45 debug Configuring master_wins
Nov 11 20:39:45 debug Getting configuration node list
Nov 11 20:39:45 debug Initializing qdevice model
Nov 11 20:39:45 debug Initializing qdevice_net_instance
Nov 11 20:39:45 debug Registering algorithms
Nov 11 20:39:45 debug Initializing NSS
Nov 11 20:39:45 debug Cast vote timer remains stopped.
Nov 11 20:39:45 debug Initializing cmap tracking
Nov 11 20:39:45 debug Waiting for ring id
Nov 11 20:39:45 debug Votequorum nodelist notify callback:
Nov 11 20:39:45 debug Ring_id = (1.12)
Nov 11 20:39:45 debug Node list (size = 2):
Nov 11 20:39:45 debug 0 nodeid = 1
Nov 11 20:39:45 debug 1 nodeid = 2
Nov 11 20:39:45 debug Algorithm decided to pause cast vote timer and result vote is No change
Nov 11 20:39:45 debug Cast vote timer is now paused.
Nov 11 20:39:45 debug Votequorum quorum notify callback:
Nov 11 20:39:45 debug Quorate = 1
Nov 11 20:39:45 debug Node list (size = 3):
Nov 11 20:39:45 debug 0 nodeid = 1, state = 1
Nov 11 20:39:45 debug 1 nodeid = 2, state = 1
Nov 11 20:39:45 debug 2 nodeid = 0, state = 0
Nov 11 20:39:45 debug Algorithm decided to not send list and result vote is No change
Nov 11 20:39:45 debug Waiting for initial heuristics exec result
Nov 11 20:39:45 debug worker: Heuristic worker initialized
Nov 11 20:39:45 debug worker: qdevice_heuristics_worker_cmd_process_one_line: Received exec-list-clear command
Nov 11 20:39:45 debug worker: qdevice_heuristics_worker_cmd_process_exec: Received exec command with seq_no "1" and timeout "15000"
Nov 11 20:39:45 debug Received heuristics exec result command with seq_no "1" and result "Disabled"
Nov 11 20:39:45 debug Votequorum heuristics exec result callback:
Nov 11 20:39:45 debug seq_number = 1, exec_result = Disabled
Nov 11 20:39:45 debug Algorithm decided to not send list, result vote is No change and heuristics is Undefined
Nov 11 20:39:45 debug Cast vote timer is no longer paused.
Nov 11 20:39:45 debug Not scheduling heuristics timer because mode is not enabled
Nov 11 20:39:45 debug Running qdevice model
Nov 11 20:39:45 debug Executing qdevice-net
Nov 11 20:39:45 debug Trying connect to qnetd server 10.192.3.243:5403 (timeout = 8000ms)
Nov 11 20:39:45 debug Sending preinit msg to qnetd
Nov 11 20:39:45 debug Received preinit reply msg
Nov 11 20:39:45 debug Sending client auth data.
Nov 11 20:39:45 debug Received init reply msg
Nov 11 20:39:45 debug Scheduling send of heartbeat every 8000ms
Nov 11 20:39:45 debug Executing after-connect heuristics.
Nov 11 20:39:45 debug worker: qdevice_heuristics_worker_cmd_process_exec: Received exec command with seq_no "2" and timeout "5000"
Nov 11 20:39:45 debug Received heuristics exec result command with seq_no "2" and result "Disabled"
Nov 11 20:39:45 debug Algorithm decided to send config node list, send membership node list, send quorum node list, heuristics is Undefined and result vote is Wait for reply
Nov 11 20:39:45 debug Sending config node list seq = 4
Nov 11 20:39:45 debug Node list:
Nov 11 20:39:45 debug Sending membership node list seq = 5, ringid = (1.12), heuristics = Undefined.
Nov 11 20:39:45 debug Node list:
Nov 11 20:39:45 debug 0 node_id = 1, data_center_id = 0, node_state = not set
Nov 11 20:39:45 debug 1 node_id = 2, data_center_id = 0, node_state = not set
Nov 11 20:39:45 debug Sending quorum node list seq = 6, quorate = 1
Nov 11 20:39:45 debug Node list:
Nov 11 20:39:45 debug 0 node_id = 1, data_center_id = 0, node_state = member
Nov 11 20:39:45 debug 1 node_id = 2, data_center_id = 0, node_state = member
Nov 11 20:39:45 debug Cast vote timer remains stopped.
Nov 11 20:39:45 error Received server error 18. Disconnecting from server
Nov 11 20:39:45 debug Algorithm result vote is NACK
Nov 11 20:39:45 debug Cast vote timer is now scheduled every 5000ms voting NACK.
Nov 11 20:39:45 debug Removing cmap tracking
Nov 11 20:39:45 debug Destroying qdevice model
Nov 11 20:39:45 debug Destroying algorithm
Nov 11 20:39:45 debug Destroying qdevice_net_instance
Nov 11 20:39:45 debug Shutting down NSS
Nov 11 20:39:45 debug Destroying qdevice ipc
Nov 11 20:39:45 debug Destroying votequorum and cmap
Nov 11 20:39:45 debug Destroying heuristics
Nov 11 20:39:45 debug Waiting for heuristics worker to finish
Nov 11 20:39:45 debug worker: Heuristic worker shutdown requested
Nov 11 20:39:45 debug worker: Waiting for all processes to exit
Nov 11 20:39:45 debug Closing log
Seems the service wont start for some reason, is this a bug? I cant find any communication issues between the nodes and Q but I am open to suggestions where to look next, everything seems ok apart from the service wont start on any of the nodes.