[SOLVED] Corosync-Qdevice Problems

nicedevil

Member
Aug 5, 2021
112
11
23
Hey guys, I tryed it multiple times to add a qdevice on my cluster, without any luck.

My qdevice will be raspberry pi with debian based OS on it.

They output I get is always this here:

Bash:
root@gateway:~# pvecm qdevice setup 10.4.0.7
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
                (if you think this is a mistake, you may want to use -f option)


INFO: initializing qnetd server
Creating /etc/corosync/qnetd/nssdb
Creating new key and cert db
password file contains no data
Creating new noise file /etc/corosync/qnetd/nssdb/noise.txt
Creating new CA


Generating key.  This may take a few moments...

Is this a CA certificate [y/N]?
Enter the path length constraint, enter to skip [<0 for unlimited path]: > Is this a critical extension [y/N]?


Generating key.  This may take a few moments...

Notice: Trust flag u is set automatically if the private key is present.
QNetd CA certificate is exported as /etc/corosync/qnetd/nssdb/qnetd-cacert.crt

INFO: copying CA cert and initializing on all nodes
bash: line 1: corosync-qdevice-net-certutil: command not found
Certificate database already exists. Delete it to continue

INFO: generating cert request
command 'corosync-qdevice-net-certutil -r -n node1' failed: open3: exec of corosync-qdevice-net-certutil -r -n node1 failed: No such file or directory at /usr/share/perl5/PVE/Tools.pm line 455.
root@node1:~#

How can I fix this?
 
Hi,
did you already install the corosync-qdevice package on all cluster nodes? See here for the documentation.
 
  • Like
Reactions: nicedevil
Hi, I did, I have 2 nodes.
I installed the corosync-qdevice to my pi and afterwards to my node1 and then to node2.

Maybe it is worth to mention that I had faulty nodes in the past that I had to remove from my cluster. I don't know if there is any left config of them inside anywhere?
 
As mentioned in the documentation, you need to install the corosync-qnetd on the external server. What does
Code:
dpkg --list corosync-qdevice
which corosync-qdevice-net-certutil
on one of the cluster nodes show?
 
  • Like
Reactions: nicedevil
As mentioned in the documentation, you need to install the corosync-qnetd on the external server. What does
Code:
dpkg --list corosync-qdevice
which corosync-qdevice-net-certutil
on one of the cluster nodes show?
Hi, I was just stupid on copy pasting the commands, I run the command for the raspberry also on both nodes....
Ok that is fixed now. The main nodes got corosync-qdevice and the pi corosync-qnetd installed but on the install command I get this

Bash:
root@node2:~# pvecm qdevice setup 10.4.0.7 -f
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
                (if you think this is a mistake, you may want to use -f option)


INFO: initializing qnetd server
Certificate database (/etc/corosync/qnetd/nssdb) already exists. Delete it to initialize new db

INFO: copying CA cert and initializing on all nodes

node 'node1': Creating /etc/corosync/qdevice/net/nssdb
password file contains no data
node 'node1': Creating new key and cert db
node 'node1': Creating new noise file /etc/corosync/qdevice/net/nssdb/noise.txt
node 'node1': Importing CA
node 'node2': Creating /etc/corosync/qdevice/net/nssdb
password file contains no data
node 'node2': Creating new key and cert db
node 'node2': Creating new noise file /etc/corosync/qdevice/net/nssdb/noise.txt
node 'node2': Importing CA
INFO: generating cert request
Creating new certificate request


Generating key.  This may take a few moments...

Certificate request stored in /etc/corosync/qdevice/net/nssdb/qdevice-net-node.crq

INFO: copying exported cert request to qnetd server

INFO: sign and export cluster cert
Signing cluster certificate
Certificate stored in /etc/corosync/qnetd/nssdb/cluster-skynet.crt

INFO: copy exported CRT

INFO: import certificate
Importing signed cluster certificate
Notice: Trust flag u is set automatically if the private key is present.
pk12util: PKCS12 EXPORT SUCCESSFUL
Certificate stored in /etc/corosync/qdevice/net/nssdb/qdevice-net-node.p12

INFO: copy and import pk12 cert to all nodes

node 'node1': Importing cluster certificate and key
node 'node1': pk12util: PKCS12 IMPORT SUCCESSFUL
node 'node2': Importing cluster certificate and key
node 'node2': pk12util: PKCS12 IMPORT SUCCESSFUL
INFO: add QDevice to cluster configuration

INFO: start and enable corosync qdevice daemon on node 'node1'...
Job for corosync-qdevice.service failed because the control process exited with error code.
See "systemctl status corosync-qdevice.service" and "journalctl -xe" for details.
command 'ssh -o 'BatchMode=yes' -lroot 10.4.0.11 systemctl start corosync-qdevice' failed: exit code 1

journalctl -xe on the 10.4.0.11 is giving me this:

Bash:
Mar 17 19:32:24 node1 systemd[1]: corosync-qdevice.service: Scheduled restart job, restart counter is at 5.
░░ Subject: Automatic restarting of a unit has been scheduled
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ Automatic restarting of the unit corosync-qdevice.service has been scheduled, as the result for
░░ the configured Restart= setting for the unit.
Mar 17 19:32:24 node1 systemd[1]: Stopped Corosync Qdevice daemon.
░░ Subject: A stop job for unit corosync-qdevice.service has finished
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A stop job for unit corosync-qdevice.service has finished.
░░
░░ The job identifier is 5959 and the job result is done.
Mar 17 19:32:24 node1 systemd[1]: corosync-qdevice.service: Start request repeated too quickly.
Mar 17 19:32:24 node1 systemd[1]: corosync-qdevice.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit corosync-qdevice.service has entered the 'failed' state with result 'exit-code'.
Mar 17 19:32:24 gateway systemd[1]: Failed to start Corosync Qdevice daemon.
░░ Subject: A start job for unit corosync-qdevice.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit corosync-qdevice.service has finished with a failure.
░░
░░ The job identifier is 5959 and the job result is failed.

and here journalctl -b -u

Bash:
Mar 15 20:30:21 node1 systemd[1]: Started Corosync Cluster Engine.
Mar 15 20:30:26 node1 corosync[1695]:   [KNET  ] rx: host: 1 link: 0 is up
Mar 15 20:30:26 node1 corosync[1695]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar 15 20:30:26 node1 corosync[1695]:   [QUORUM] Sync members[2]: 1 2
Mar 15 20:30:26 node1 corosync[1695]:   [QUORUM] Sync joined[1]: 1
Mar 15 20:30:26 node1 corosync[1695]:   [TOTEM ] A new membership (1.962) was formed. Members joined: 1
Mar 15 20:30:26 node1 corosync[1695]:   [QUORUM] This node is within the primary component and will provide service.
Mar 15 20:30:26 node1 corosync[1695]:   [QUORUM] Members[2]: 1 2
Mar 15 20:30:26 node1 corosync[1695]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 15 20:30:26 node1 corosync[1695]:   [KNET  ] pmtud: Global data MTU changed to: 469
Mar 15 20:30:40 node1 corosync[1695]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Mar 15 20:30:40 node1 corosync[1695]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 15 22:00:22 node1 corosync[1695]:   [CFG   ] Node 1 was shut down by sysadmin
Mar 15 22:00:22 node1 corosync[1695]:   [QUORUM] Sync members[1]: 2
Mar 15 22:00:22 node1 corosync[1695]:   [QUORUM] Sync left[1]: 1
Mar 15 22:00:22 node1 corosync[1695]:   [TOTEM ] A new membership (2.966) was formed. Members left: 1
Mar 15 22:00:22 node1 corosync[1695]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Mar 15 22:00:22 node1 corosync[1695]:   [QUORUM] Members[1]: 2
Mar 15 22:00:22 node1 corosync[1695]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 15 22:00:23 node1 corosync[1695]:   [KNET  ] link: host: 1 link: 0 is down
Mar 15 22:00:23 node1 corosync[1695]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar 15 22:00:23 node1 corosync[1695]:   [KNET  ] host: host: 1 has no active links
Mar 16 10:01:43 node1 corosync[1695]:   [KNET  ] rx: host: 1 link: 0 is up
 
Last edited:
journalctl -xe on the 10.4.0.11 is giving me this:

Bash:
Mar 17 19:32:24 node1 systemd[1]: corosync-qdevice.service: Scheduled restart job, restart counter is at 5.
░░ Subject: Automatic restarting of a unit has been scheduled
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ Automatic restarting of the unit corosync-qdevice.service has been scheduled, as the result for
░░ the configured Restart= setting for the unit.
Mar 17 19:32:24 node1 systemd[1]: Stopped Corosync Qdevice daemon.
░░ Subject: A stop job for unit corosync-qdevice.service has finished
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A stop job for unit corosync-qdevice.service has finished.
░░
░░ The job identifier is 5959 and the job result is done.
Mar 17 19:32:24 node1 systemd[1]: corosync-qdevice.service: Start request repeated too quickly.
Mar 17 19:32:24 node1 systemd[1]: corosync-qdevice.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit corosync-qdevice.service has entered the 'failed' state with result 'exit-code'.
Mar 17 19:32:24 gateway systemd[1]: Failed to start Corosync Qdevice daemon.
░░ Subject: A start job for unit corosync-qdevice.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit corosync-qdevice.service has finished with a failure.
░░
░░ The job identifier is 5959 and the job result is failed.
This is just telling you that it was tried to start too often. The real error should be further above. After getting rid of the cause of the error, you can use systemctl reset-failed corosync-qdevice.service and than try to start it again.

and here journalctl -b -u

Bash:
Mar 15 20:30:21 node1 systemd[1]: Started Corosync Cluster Engine.
Mar 15 20:30:26 node1 corosync[1695]:   [KNET  ] rx: host: 1 link: 0 is up
Mar 15 20:30:26 node1 corosync[1695]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar 15 20:30:26 node1 corosync[1695]:   [QUORUM] Sync members[2]: 1 2
Mar 15 20:30:26 node1 corosync[1695]:   [QUORUM] Sync joined[1]: 1
Mar 15 20:30:26 node1 corosync[1695]:   [TOTEM ] A new membership (1.962) was formed. Members joined: 1
Mar 15 20:30:26 node1 corosync[1695]:   [QUORUM] This node is within the primary component and will provide service.
Mar 15 20:30:26 node1 corosync[1695]:   [QUORUM] Members[2]: 1 2
Mar 15 20:30:26 node1 corosync[1695]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 15 20:30:26 node1 corosync[1695]:   [KNET  ] pmtud: Global data MTU changed to: 469
Mar 15 20:30:40 node1 corosync[1695]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Mar 15 20:30:40 node1 corosync[1695]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Mar 15 22:00:22 node1 corosync[1695]:   [CFG   ] Node 1 was shut down by sysadmin
Mar 15 22:00:22 node1 corosync[1695]:   [QUORUM] Sync members[1]: 2
Mar 15 22:00:22 node1 corosync[1695]:   [QUORUM] Sync left[1]: 1
Mar 15 22:00:22 node1 corosync[1695]:   [TOTEM ] A new membership (2.966) was formed. Members left: 1
Mar 15 22:00:22 node1 corosync[1695]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Mar 15 22:00:22 node1 corosync[1695]:   [QUORUM] Members[1]: 2
Mar 15 22:00:22 node1 corosync[1695]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 15 22:00:23 node1 corosync[1695]:   [KNET  ] link: host: 1 link: 0 is down
Mar 15 22:00:23 node1 corosync[1695]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Mar 15 22:00:23 node1 corosync[1695]:   [KNET  ] host: host: 1 has no active links
Mar 16 10:01:43 node1 corosync[1695]:   [KNET  ] rx: host: 1 link: 0 is up
Isn't this the output for corosync.serivce rather than corosync-qdevice.service?
 
  • Like
Reactions: nicedevil
Hi Fabian,

ty for all your tips so far. I found this after trying to reset and restart it:

Bash:
A start job for unit corosync-qdevice.service has begun execution.
░░
░░ The job identifier is 7224.
Mar 18 20:06:20 node1 corosync-qdevice[2828938]: Can't read quorum.device.model cmap key.
Mar 18 20:06:20 node1 systemd[1]: corosync-qdevice.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ An ExecStart= process belonging to unit corosync-qdevice.service has exited.

Then I was able to google for this error and found this thread here in the forum => https://forum.proxmox.com/threads/2-node-cluster-advice-on-adding-qdevice.99151/

That lead me to this https://www.danatec.org/2021/05/21/two-node-cluster-in-proxmox-ve-with-raspberry-pi-as-qdevice/
I had the remove the old qdevice first with `pvecm qdevice remove` and then I was able to add it again :)
 
  • Like
Reactions: fiona

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!