[SOLVED] 2-node cluster with qdevice - quorum lost when one of the 2 nodes is down

breakaway9000

Renowned Member
Dec 20, 2015
93
24
73
Hi everyone

I just set this up a few days ago. Setup went very smoothly (no errors).

But when testing, seem to be running into an issue when I reboot one of my two nodes. It says something like "No quorum on node1" in Datacenter -> HA (where node1 is the remaining online node). From what I understand it should say "Quorum OK", right?

I have a main LAN (10.1.10.0/24) which is only 1gbps and also a dedicated 10gbps network (10.2.10.0/24) for the cluster to use. My qdevice has NICs on both networks, and I have confirmed all-way pings between my two proxmox nodes and my qdevice on both nics.

Can someone please take a look at the output of pvecm ndoes and pvecm status below, see if I have done anything obvious wrong?

Here is how it looks when things are normal:

Code:
# pvecm nodes
Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1   A,NV,NMW node2
         2          1   A,NV,NMW node1 (local)
         0          0            Qdevice (votes 1)

# pvecm status
Cluster information
-------------------
Name:             clu01
Config Version:   3
Transport:        knet
Secure auth:      on
Quorum information
------------------
Date:             Sat Sep 20 13:06:28 2025
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1.59
Quorate:          Yes
Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2
Flags:            Quorate Qdevice
Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1   A,NV,NMW 10.2.10.131
0x00000002          1   A,NV,NMW 10.2.10.132 (local)
0x00000000          0            Qdevice (votes 1)

And this is when node2 is offline (say due to a reboot or something):

Code:
# pvecm nodes

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         2          1   A,NV,NMW node1 (local)
         0          0            Qdevice (votes 1)

# pvecm status
Cluster information
-------------------
Name:             clu01
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sat Sep 20 13:04:50 2025
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000002
Ring ID:          2.54
Quorate:          No

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      1
Quorum:           2 Activity blocked
Flags:            Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000002          1   A,NV,NMW 10.2.10.132 (local)
0x00000000          0            Qdevice (votes 1)

Ok some progress - on my qdevice, the logs are full of

Code:
corosync-qnetd[878]: Unhandled error when reading from client. Disconnecting client (-12271): SSL peer cannot verify your certificate.
corosync-qnetd[878]: Unhandled error when reading from client. Disconnecting client (-12271): SSL peer cannot verify your certificate.
corosync-qnetd[878]: Unhandled error when reading from client. Disconnecting client (-12271): SSL peer cannot verify your certificate.
corosync-qnetd[878]: Unhandled error when reading from client. Disconnecting client (-12271): SSL peer cannot verify your certificate.
corosync-qnetd[878]: Unhandled error when reading from client. Disconnecting client (-12271): SSL peer cannot verify your certificate.
corosync-qnetd[878]: Unhandled error when reading from client. Disconnecting client (-12271): SSL peer cannot verify your certificate.
corosync-qnetd[878]: Unhandled error when reading from client. Disconnecting client (-12271): SSL peer cannot verify your certificate.
corosync-qnetd[878]: Unhandled error when reading from client. Disconnecting client (-12271): SSL peer cannot verify your certificate.
corosync-qnetd[878]: Unhandled error when reading from client. Disconnecting client (-12271): SSL peer cannot verify your certificate.
corosync-qnetd[878]: Unhandled error when reading from client. Disconnecting client (-12271): SSL peer cannot verify your certificate

So I tried removing & re-adding the device:

Code:
# pvecm qdevice remove
Synchronizing state of corosync-qdevice.service with SysV service script with /usr/lib/systemd/systemd-sysv-install.
Executing: /usr/lib/systemd/systemd-sysv-install disable corosync-qdevice
Removed '/etc/systemd/system/multi-user.target.wants/corosync-qdevice.service'.
Synchronizing state of corosync-qdevice.service with SysV service script with /usr/lib/systemd/systemd-sysv-install.
Executing: /usr/lib/systemd/systemd-sysv-install disable corosync-qdevice
Removed '/etc/systemd/system/multi-user.target.wants/corosync-qdevice.service'.
Reloading corosync.conf...
Done

Code:
# pvecm qdevice setup 10.2.10.120
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@10.2.10.120's password:

Number of key(s) added: 1

Now try logging into the machine, with: "ssh -i /root/.ssh/id_rsa 'root@10.2.10.120'"
and check to make sure that only the key(s) you wanted were added.


INFO: initializing qnetd server
Certificate database (/etc/corosync/qnetd/nssdb) already exists. Delete it to initialize new db

INFO: copying CA cert and initializing on all nodes

node 'node1': Creating /etc/corosync/qdevice/net/nssdb
password file contains no data
node 'node1': Creating new key and cert db
node 'node1': Creating new noise file /etc/corosync/qdevice/net/nssdb/noise.txt
node 'node1': Importing CA
node 'node2': Creating /etc/corosync/qdevice/net/nssdb
password file contains no data
node 'node2': Creating new key and cert db
node 'node2': Creating new noise file /etc/corosync/qdevice/net/nssdb/noise.txt
node 'node2': Importing CA
INFO: generating cert request
Creating new certificate request


Generating key.  This may take a few moments...

Certificate request stored in /etc/corosync/qdevice/net/nssdb/qdevice-net-node.crq

INFO: copying exported cert request to qnetd server

INFO: sign and export cluster cert
Signing cluster certificate
Certificate stored in /etc/corosync/qnetd/nssdb/cluster-clu01.crt

INFO: copy exported CRT

INFO: import certificate
Importing signed cluster certificate
Notice: Trust flag u is set automatically if the private key is present.
pk12util: PKCS12 EXPORT SUCCESSFUL
Certificate stored in /etc/corosync/qdevice/net/nssdb/qdevice-net-node.p12

INFO: copy and import pk12 cert to all nodes

node 'node1': Importing cluster certificate and key
node 'node1': pk12util: PKCS12 IMPORT SUCCESSFUL
node 'node2': Importing cluster certificate and key
node 'node2': pk12util: PKCS12 IMPORT SUCCESSFUL
INFO: add QDevice to cluster configuration

INFO: start and enable corosync qdevice daemon on node 'node1'...
Synchronizing state of corosync-qdevice.service with SysV service script with /usr/lib/systemd/systemd-sysv-install.
Executing: /usr/lib/systemd/systemd-sysv-install enable corosync-qdevice
Created symlink '/etc/systemd/system/multi-user.target.wants/corosync-qdevice.service' -> '/usr/lib/systemd/system/corosync-qdevice.service'.

INFO: start and enable corosync qdevice daemon on node 'node2'...
Synchronizing state of corosync-qdevice.service with SysV service script with /usr/lib/systemd/systemd-sysv-install.
Executing: /usr/lib/systemd/systemd-sysv-install enable corosync-qdevice
Created symlink '/etc/systemd/system/multi-user.target.wants/corosync-qdevice.service' -> '/usr/lib/systemd/system/corosync-qdevice.service'.
Reloading corosync.conf...
Done

As you can see, no errors are reported in the re-adding of the qdevice but these errors persist:

Code:
corosync-qnetd[878]: Unhandled error when reading from client. Disconnecting client (-12271): SSL peer cannot verify your certificate.

Ok, finally resolved this, ran this on my qdevice

Code:
apt-get remove --purge corosync-qnetd corosync-qdevice corosync

(the `--purge` directive removes all the config files)

Then re-installed:

Code:
apt install corosync-qnetd -y && apt install corosync-qdevice -y

Now, quorum is retained when one node goes offline for maintenance or whatever, and the qdevice has the ability to cast a vote (note the "1" in the votes column, previously it was 0):

Code:
# pvecm nodes

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1    A,V,NMW node1 (local)
         2          1    A,V,NMW node2
         0          1            Qdevice

And corosync service is running and healthy:

Code:
# systemctl status corosync-qnetd
● corosync-qnetd.service - Corosync Qdevice Network daemon
     Loaded: loaded (/lib/systemd/system/corosync-qnetd.service; enabled; preset: enabled)
     Active: active (running) since Sat 2025-09-20 16:13:45 NZST; 3min 31s ago
       Docs: man:corosync-qnetd
   Main PID: 856 (corosync-qnetd)
      Tasks: 1 (limit: 4646)
     Memory: 6.6M
        CPU: 56ms
     CGroup: /system.slice/corosync-qnetd.service
             └─856 /usr/bin/corosync-qnetd -f

Sep 20 16:13:45 carnelian systemd[1]: Starting corosync-qnetd.service - Corosync Qdevice Network daemon...
Sep 20 16:13:45 carnelian systemd[1]: Started corosync-qnetd.service - Corosync Qdevice Network daemon.

I'm guessing something happened originally when I was messing around with this that broke the config files, and re-adding the qdevice does not overwrite those configs?
 
Last edited: