[SOLVED] Corosync-Qdevice Problems

nicedevil · Mar 16, 2022

Hey guys, I tryed it multiple times to add a qdevice on my cluster, without any luck.

My qdevice will be raspberry pi with debian based OS on it.

They output I get is always this here:

Bash:

root@gateway:~# pvecm qdevice setup 10.4.0.7
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
                (if you think this is a mistake, you may want to use -f option)


INFO: initializing qnetd server
Creating /etc/corosync/qnetd/nssdb
Creating new key and cert db
password file contains no data
Creating new noise file /etc/corosync/qnetd/nssdb/noise.txt
Creating new CA


Generating key.  This may take a few moments...

Is this a CA certificate [y/N]?
Enter the path length constraint, enter to skip [<0 for unlimited path]: > Is this a critical extension [y/N]?


Generating key.  This may take a few moments...

Notice: Trust flag u is set automatically if the private key is present.
QNetd CA certificate is exported as /etc/corosync/qnetd/nssdb/qnetd-cacert.crt

INFO: copying CA cert and initializing on all nodes
bash: line 1: corosync-qdevice-net-certutil: command not found
Certificate database already exists. Delete it to continue

INFO: generating cert request
command 'corosync-qdevice-net-certutil -r -n node1' failed: open3: exec of corosync-qdevice-net-certutil -r -n node1 failed: No such file or directory at /usr/share/perl5/PVE/Tools.pm line 455.
root@node1:~#

How can I fix this?

fiona · Mar 17, 2022

Hi,
did you already install the corosync-qdevice package on all cluster nodes? See here for the documentation.

nicedevil · Mar 17, 2022

Hi, I did, I have 2 nodes.
I installed the corosync-qdevice to my pi and afterwards to my node1 and then to node2.

Maybe it is worth to mention that I had faulty nodes in the past that I had to remove from my cluster. I don't know if there is any left config of them inside anywhere?

fiona · Mar 17, 2022

As mentioned in the documentation, you need to install the corosync-qnetd on the external server. What does

Code:

dpkg --list corosync-qdevice
which corosync-qdevice-net-certutil

on one of the cluster nodes show?

nicedevil · Mar 17, 2022

Fabian_E said:
As mentioned in the documentation, you need to install the corosync-qnetd on the external server. What does

Code:

dpkg --list corosync-qdevice which corosync-qdevice-net-certutil

on one of the cluster nodes show?

Hi, I was just stupid on copy pasting the commands, I run the command for the raspberry also on both nodes....
Ok that is fixed now. The main nodes got corosync-qdevice and the pi corosync-qnetd installed but on the install command I get this

Bash:

root@node2:~# pvecm qdevice setup 10.4.0.7 -f
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
                (if you think this is a mistake, you may want to use -f option)


INFO: initializing qnetd server
Certificate database (/etc/corosync/qnetd/nssdb) already exists. Delete it to initialize new db

INFO: copying CA cert and initializing on all nodes

node 'node1': Creating /etc/corosync/qdevice/net/nssdb
password file contains no data
node 'node1': Creating new key and cert db
node 'node1': Creating new noise file /etc/corosync/qdevice/net/nssdb/noise.txt
node 'node1': Importing CA
node 'node2': Creating /etc/corosync/qdevice/net/nssdb
password file contains no data
node 'node2': Creating new key and cert db
node 'node2': Creating new noise file /etc/corosync/qdevice/net/nssdb/noise.txt
node 'node2': Importing CA
INFO: generating cert request
Creating new certificate request


Generating key.  This may take a few moments...

Certificate request stored in /etc/corosync/qdevice/net/nssdb/qdevice-net-node.crq

INFO: copying exported cert request to qnetd server

INFO: sign and export cluster cert
Signing cluster certificate
Certificate stored in /etc/corosync/qnetd/nssdb/cluster-skynet.crt

INFO: copy exported CRT

INFO: import certificate
Importing signed cluster certificate
Notice: Trust flag u is set automatically if the private key is present.
pk12util: PKCS12 EXPORT SUCCESSFUL
Certificate stored in /etc/corosync/qdevice/net/nssdb/qdevice-net-node.p12

INFO: copy and import pk12 cert to all nodes

node 'node1': Importing cluster certificate and key
node 'node1': pk12util: PKCS12 IMPORT SUCCESSFUL
node 'node2': Importing cluster certificate and key
node 'node2': pk12util: PKCS12 IMPORT SUCCESSFUL
INFO: add QDevice to cluster configuration

INFO: start and enable corosync qdevice daemon on node 'node1'...
Job for corosync-qdevice.service failed because the control process exited with error code.
See "systemctl status corosync-qdevice.service" and "journalctl -xe" for details.
command 'ssh -o 'BatchMode=yes' -lroot 10.4.0.11 systemctl start corosync-qdevice' failed: exit code 1

journalctl -xe on the 10.4.0.11 is giving me this:

Bash:

Mar 17 19:32:24 node1 systemd[1]: corosync-qdevice.service: Scheduled restart job, restart counter is at 5.
░░ Subject: Automatic restarting of a unit has been scheduled
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ Automatic restarting of the unit corosync-qdevice.service has been scheduled, as the result for
░░ the configured Restart= setting for the unit.
Mar 17 19:32:24 node1 systemd[1]: Stopped Corosync Qdevice daemon.
░░ Subject: A stop job for unit corosync-qdevice.service has finished
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A stop job for unit corosync-qdevice.service has finished.
░░
░░ The job identifier is 5959 and the job result is done.
Mar 17 19:32:24 node1 systemd[1]: corosync-qdevice.service: Start request repeated too quickly.
Mar 17 19:32:24 node1 systemd[1]: corosync-qdevice.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit corosync-qdevice.service has entered the 'failed' state with result 'exit-code'.
Mar 17 19:32:24 gateway systemd[1]: Failed to start Corosync Qdevice daemon.
░░ Subject: A start job for unit corosync-qdevice.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit corosync-qdevice.service has finished with a failure.
░░
░░ The job identifier is 5959 and the job result is failed.

and here journalctl -b -u

Bash:

Mar 15 20:30:21 node1 systemd[1]: Started Corosync Cluster Engine.
Mar 15 20:30:26 node1 corosync[1695]:   [KNET  ] rx: host: 1 link: 0 is up
Mar 15 20:30:26 node1 corosync[1695]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar 15 20:30:26 node1 corosync[1695]:   [QUORUM] Sync members[2]: 1 2
Mar 15 20:30:26 node1 corosync[1695]:   [QUORUM] Sync joined[1]: 1
Mar 15 20:30:26 node1 corosync[1695]:   [TOTEM ] A new membership (1.962) was formed. Members joined: 1
Mar 15 20:30:26 node1 corosync[1695]:   [QUORUM] This node is within the primary component and will provide service.
Mar 15 20:30:26 node1 corosync[1695]:   [QUORUM] Members[2]: 1 2
Mar 15 20:30:26 node1 corosync[1695]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 15 20:30:26 node1 corosync[1695]:   [KNET  ] pmtud: Global data MTU changed to: 469
Mar 15 20:30:40 node1 corosync[1695]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Mar 15 20:30:40 node1 corosync[1695]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 15 22:00:22 node1 corosync[1695]:   [CFG   ] Node 1 was shut down by sysadmin
Mar 15 22:00:22 node1 corosync[1695]:   [QUORUM] Sync members[1]: 2
Mar 15 22:00:22 node1 corosync[1695]:   [QUORUM] Sync left[1]: 1
Mar 15 22:00:22 node1 corosync[1695]:   [TOTEM ] A new membership (2.966) was formed. Members left: 1
Mar 15 22:00:22 node1 corosync[1695]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Mar 15 22:00:22 node1 corosync[1695]:   [QUORUM] Members[1]: 2
Mar 15 22:00:22 node1 corosync[1695]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 15 22:00:23 node1 corosync[1695]:   [KNET  ] link: host: 1 link: 0 is down
Mar 15 22:00:23 node1 corosync[1695]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar 15 22:00:23 node1 corosync[1695]:   [KNET  ] host: host: 1 has no active links
Mar 16 10:01:43 node1 corosync[1695]:   [KNET  ] rx: host: 1 link: 0 is up

fiona · Mar 18, 2022

nicedevil said:

journalctl -xe on the 10.4.0.11 is giving me this:

Bash:

Mar 17 19:32:24 node1 systemd[1]: corosync-qdevice.service: Scheduled restart job, restart counter is at 5.
░░ Subject: Automatic restarting of a unit has been scheduled
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ Automatic restarting of the unit corosync-qdevice.service has been scheduled, as the result for
░░ the configured Restart= setting for the unit.
Mar 17 19:32:24 node1 systemd[1]: Stopped Corosync Qdevice daemon.
░░ Subject: A stop job for unit corosync-qdevice.service has finished
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A stop job for unit corosync-qdevice.service has finished.
░░
░░ The job identifier is 5959 and the job result is done.
Mar 17 19:32:24 node1 systemd[1]: corosync-qdevice.service: Start request repeated too quickly.
Mar 17 19:32:24 node1 systemd[1]: corosync-qdevice.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit corosync-qdevice.service has entered the 'failed' state with result 'exit-code'.
Mar 17 19:32:24 gateway systemd[1]: Failed to start Corosync Qdevice daemon.
░░ Subject: A start job for unit corosync-qdevice.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit corosync-qdevice.service has finished with a failure.
░░
░░ The job identifier is 5959 and the job result is failed.

This is just telling you that it was tried to start too often. The real error should be further above. After getting rid of the cause of the error, you can use systemctl reset-failed corosync-qdevice.service and than try to start it again.

nicedevil said:

and here journalctl -b -u

Bash:

Mar 15 20:30:21 node1 systemd[1]: Started Corosync Cluster Engine.
Mar 15 20:30:26 node1 corosync[1695]:   [KNET  ] rx: host: 1 link: 0 is up
Mar 15 20:30:26 node1 corosync[1695]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar 15 20:30:26 node1 corosync[1695]:   [QUORUM] Sync members[2]: 1 2
Mar 15 20:30:26 node1 corosync[1695]:   [QUORUM] Sync joined[1]: 1
Mar 15 20:30:26 node1 corosync[1695]:   [TOTEM ] A new membership (1.962) was formed. Members joined: 1
Mar 15 20:30:26 node1 corosync[1695]:   [QUORUM] This node is within the primary component and will provide service.
Mar 15 20:30:26 node1 corosync[1695]:   [QUORUM] Members[2]: 1 2
Mar 15 20:30:26 node1 corosync[1695]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 15 20:30:26 node1 corosync[1695]:   [KNET  ] pmtud: Global data MTU changed to: 469
Mar 15 20:30:40 node1 corosync[1695]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Mar 15 20:30:40 node1 corosync[1695]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 15 22:00:22 node1 corosync[1695]:   [CFG   ] Node 1 was shut down by sysadmin
Mar 15 22:00:22 node1 corosync[1695]:   [QUORUM] Sync members[1]: 2
Mar 15 22:00:22 node1 corosync[1695]:   [QUORUM] Sync left[1]: 1
Mar 15 22:00:22 node1 corosync[1695]:   [TOTEM ] A new membership (2.966) was formed. Members left: 1
Mar 15 22:00:22 node1 corosync[1695]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Mar 15 22:00:22 node1 corosync[1695]:   [QUORUM] Members[1]: 2
Mar 15 22:00:22 node1 corosync[1695]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 15 22:00:23 node1 corosync[1695]:   [KNET  ] link: host: 1 link: 0 is down
Mar 15 22:00:23 node1 corosync[1695]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar 15 22:00:23 node1 corosync[1695]:   [KNET  ] host: host: 1 has no active links
Mar 16 10:01:43 node1 corosync[1695]:   [KNET  ] rx: host: 1 link: 0 is up

Isn't this the output for corosync.serivce rather than corosync-qdevice.service?

nicedevil · Mar 18, 2022

Hi Fabian,

ty for all your tips so far. I found this after trying to reset and restart it:

Bash:

A start job for unit corosync-qdevice.service has begun execution.
░░
░░ The job identifier is 7224.
Mar 18 20:06:20 node1 corosync-qdevice[2828938]: Can't read quorum.device.model cmap key.
Mar 18 20:06:20 node1 systemd[1]: corosync-qdevice.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ An ExecStart= process belonging to unit corosync-qdevice.service has exited.

Then I was able to google for this error and found this thread here in the forum => https://forum.proxmox.com/threads/2-node-cluster-advice-on-adding-qdevice.99151/

That lead me to this https://www.danatec.org/2021/05/21/two-node-cluster-in-proxmox-ve-with-raspberry-pi-as-qdevice/
I had the remove the old qdevice first with `pvecm qdevice remove` and then I was able to add it again

Search

Search

[SOLVED] Corosync-Qdevice Problems

nicedevil

Member

fiona

Proxmox Staff Member

nicedevil

Member

fiona

Proxmox Staff Member

nicedevil

Member

fiona

Proxmox Staff Member

nicedevil

Member