[SOLVED] Broken qdevice setup.

Akegata

Member
Sep 12, 2012
11
0
21
I'm setting up a proxmox cluster with two nodes and a Raspberry Pi as a qdevice.
I first set it up at home without any issues regarding the qdevice setup. I then ran pvecm qdevice remove since I was moving the setup to another network using another Pi.

After installing the packages and enabling root ssh access on the new Pi, running pvecm qdevice setup <IP> fails, even with --force, with the following message:

Code:
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
                (if you think this is a mistake, you may want to use -f option)


INFO: initializing qnetd server
Certificate database (/etc/corosync/qnetd/nssdb) already exists. Delete it to initialize new db

INFO: copying CA cert and initializing on all nodes
Host key verification failed.
Certificate database already exists. Delete it to continue

INFO: generating cert request
Certificate database doesn't exists. Use /sbin/corosync-qdevice-net-certutil -i to create it
command 'corosync-qdevice-net-certutil -r -n SF' failed: exit code 1

I have tried purging the corosync-qnetd and corosync-qdevice packages from the Pi and reinstalling them. I have tried removing /etc/corosync/qnetd/nssdb, I've even removed authorized_keys on the Pi, but nothing changes.

This is what corosync.conf looks like:

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.55.206
  }
  node {
    name: pve2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.55.207
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: SF
  config_version: 5
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}
 
hi,

i've tried to follow the steps you did, and i was unable to reproduce the problem here with a fresh setup (following https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_qdevice_net_setup) - add, remove, add qdevice worked fine

have you installed corosync-qnetd on the pi?

what is the output of pvecm status currently?

also please include the output of pveversion -v
 
Yeah, corosync-qnetd is installed on the pi.

pvecm status:
Code:
Cluster information
-------------------
Name:             SF
Config Version:   5
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Sep 28 21:10:33 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.c2
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.55.206 (local)
0x00000002          1 192.168.55.207

pveversion -v:

Code:
proxmox-ve: 6.2-1 (running kernel: 5.4.34-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-1
pve-kernel-helper: 6.2-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.3
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-5
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1


Is there some way to completely purge all configuration related to the qdevice setup so I can do a fresh setup?
 
I can second the problem described here. Also tried setting up a qdevice in a fresh cluster and get these errors.
Qdevice service fails to start on all three machines.
Additionally, the qdevice appears with pvecm but when trying to delete it, it says " no qdevice configured".

Edit: After several attempts I had the impression that ssh logins might be my problem. Since I installed the first PVE with an IP address which is now in a different VLAN, cluster manager took this to create the cluster. But this IP address was not allowed to login as root anymore and during the qdevice setup this initial IP address is used even if I define another subnet to be used.
 
Last edited:
After several attempts I had the impression that ssh logins might be my problem. Since I installed the first PVE with an IP address which is now in a different VLAN, cluster manager took this to create the cluster. But this IP address was not allowed to login as root anymore and during the qdevice setup this initial IP address is used even if I define another subnet to be used.

that seems like it could be the issue (and also seems like changing the IP is a common pattern here for failing cluster).

Is there some way to completely purge all configuration related to the qdevice setup so I can do a fresh setup?

yes, just remove /etc/corosync/qnetd/nssdb like the error message is telling you to
 
yes, just remove /etc/corosync/qnetd/nssdb like the error message is telling you to

I don't think that completely clears out everything. I had already tried that before, but tried again now with the same result. This is the error I get after removing nssdb:

Code:
pvecm qdevice setup 192.168.55.208 --force
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
                (if you think this is a mistake, you may want to use -f option)


INFO: initializing qnetd server
Creating /etc/corosync/qnetd/nssdb
Creating new key and cert db
password file contains no data
Creating new noise file /etc/corosync/qnetd/nssdb/noise.txt
Creating new CA


Generating key.  This may take a few moments...

Is this a CA certificate [y/N]?
Enter the path length constraint, enter to skip [<0 for unlimited path]: > Is this a critical extension [y/N]?


Generating key.  This may take a few moments...

Notice: Trust flag u is set automatically if the private key is present.
QNetd CA certificate is exported as /etc/corosync/qnetd/nssdb/qnetd-cacert.crt

INFO: copying CA cert and initializing on all nodes
Host key verification failed.
Certificate database already exists. Delete it to continue

INFO: generating cert request
Certificate database doesn't exists. Use /sbin/corosync-qdevice-net-certutil -i to create it
command 'corosync-qdevice-net-certutil -r -n SF' failed: exit code 1

The "Certificate database already exists." followed by "Certificate database doesn't exists." is very confusing.
I have tried running corosync-qdevice-net-certutil -i as suggested, that gives me the error "Can't open certificate file".
 
The "Certificate database already exists." followed by "Certificate database doesn't exists." is very confusing.
I have tried running corosync-qdevice-net-certutil -i as suggested, that gives me the error "Can't open certificate file".
really weird.

this sticks out to me: Host key verification failed.

can you check if you can ssh to the pi as root?

which raspberry pi are you using? which OS version is it running? can you post the cat /etc/os-release output? it should be running at least debian buster or equivalent
 
I can ssh to the pi as root from the first cluster node, but not from the second.
Should I be able to do it from both? I would assume it's only necessary from the node where I run pvecm qdevice setup?

It's a Raspberry Pi 3;

Code:
cat /etc/os-release
PRETTY_NAME="Raspbian GNU/Linux 10 (buster)"
NAME="Raspbian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=raspbian
ID_LIKE=debian
HOME_URL="http://www.raspbian.org/"
SUPPORT_URL="http://www.raspbian.org/RaspbianForums"
BUG_REPORT_URL="http://www.raspbian.org/RaspbianBugs"
 
I can ssh to the pi as root from the first cluster node, but not from the second.
Should I be able to do it from both? I would assume it's only necessary from the node where I run pvecm qdevice setup?
all of them need to be able to connect to each other, please try like that
 
Ok, I have no idea what happened, but I managed to break something else while trying different things. I removed the nssdb databases again and /etc/pve/qdevice-net-node.p12 and probably did something else.
Now the qdevice setup worked, and the qdevice shows up in pvecm status and everything. I really don't know what actually solved this sadly. :/
I can at least say that only the node where pvecm qdevice setup is run needs to be able to log in to the other nodes as root.
 
Maybe I spoke too soon.
The qdevice is added, but it seems like it doesn't have a vote.


Code:
Cluster information
-------------------
Name:             SF
Config Version:   8
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Sep 29 13:40:24 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.109
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2 
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1   A,NV,NMW 192.168.55.206 (local)
0x00000002          1   A,NV,NMW 192.168.55.207
0x00000000          0            Qdevice (votes 1)

The corosync config says it should have one vote though:

Code:
quorum {
  device {
    model: net
    net {
      algorithm: ffsplit
      host: 192.168.55.208
      tls: on
    }
    votes: 1
  }
  provider: corosync_votequorum
}
 
Maybe I spoke too soon.
The qdevice is added, but it seems like it doesn't have a vote.

is this still the case? just asking because you've marked it [SOLVED]

if you don't get the vote please check connectivity between the machines.

you can run grep -RiE '(qdevice|corosync)' /var/log and it should give you all the entries containing qdevice or corosync in the logs. you can then go into specific files (/var/log/daemon.log and /var/log/syslog will be the most interesting).

if you can't get it working please post the output. in the log files, lines surrounding 'corosync' or 'qdevice' entries will also be useful
 
I did some more troubleshooting. The logs showed me errors saying "Network address not available", even though the host was obviously up.
I eventually realized that the corosync-qnetd service was not running on the pi, and would not start.

I ended up removing the qdevice, purging all corosync packages from the pi (again), reinstalling them and adding the qdevice back. Now it looks like it should, the qdevice shows up in pvecm status and has a vote.

Thanks for all the help with this troublesome issue.
 
great, you're welcome
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!