[SOLVED] corosync-qdevice.service fails to start with 'received server error 18. Disconnecting from server'

Nov 11, 2020
11
1
3
33
I have a problem getting a QDevice to work on proxmox 6.2-12

First I install the QDevice package on the 3rd witness (Raspberry Pi OS 20-08-2020) box:
Code:
# apt install corosync-qnetd
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  corosync-qnetd
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 49.3 kB of archives.
After this operation, 153 kB of additional disk space will be used.
Get:1 http://mirror.init7.net/raspbian/raspbian buster/main armhf corosync-qnetd armhf 3.0.0-4+deb10u1 [49.3 kB]
Fetched 49.3 kB in 1s (85.5 kB/s)   
Selecting previously unselected package corosync-qnetd.
(Reading database ... 40387 files and directories currently installed.)
Preparing to unpack .../corosync-qnetd_3.0.0-4+deb10u1_armhf.deb ...
Unpacking corosync-qnetd (3.0.0-4+deb10u1) ...
Setting up corosync-qnetd (3.0.0-4+deb10u1) ...
Creating /etc/corosync/qnetd/nssdb
Creating new key and cert db
password file contains no data
Creating new noise file /etc/corosync/qnetd/nssdb/noise.txt
Creating new CA


Generating key.  This may take a few moments...

Is this a CA certificate [y/N]?
Enter the path length constraint, enter to skip [<0 for unlimited path]: > Is this a critical extension [y/N]?


Generating key.  This may take a few moments...

Notice: Trust flag u is set automatically if the private key is present.
QNetd CA certificate is exported as /etc/corosync/qnetd/nssdb/qnetd-cacert.crt
Created symlink /etc/systemd/system/multi-user.target.wants/corosync-qnetd.service ? /lib/systemd/system/corosync-qnetd.service.
Processing triggers for man-db (2.8.5-2) ...
Processing triggers for systemd (241-7~deb10u4+rpi1) ...

I install the packages on the cluster machines as per the instructions and get this nice looking output on the cluster node:

Code:
# pvecm qdevice setup 10.192.3.243
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed

/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
        (if you think this is a mistake, you may want to use -f option)


INFO: initializing qnetd server
Certificate database (/etc/corosync/qnetd/nssdb) already exists. Delete it to initialize new db

INFO: copying CA cert and initializing on all nodes

node 'srv-pve-01': Creating /etc/corosync/qdevice/net/nssdb
password file contains no data
node 'srv-pve-01': Creating new key and cert db
node 'srv-pve-01': Creating new noise file /etc/corosync/qdevice/net/nssdb/noise.txt
node 'srv-pve-01': Importing CA
node 'srv-pve-02': Creating /etc/corosync/qdevice/net/nssdb
password file contains no data
node 'srv-pve-02': Creating new key and cert db
node 'srv-pve-02': Creating new noise file /etc/corosync/qdevice/net/nssdb/noise.txt
node 'srv-pve-02': Importing CA
INFO: generating cert request
Creating new certificate request


Generating key.  This may take a few moments...

Certificate request stored in /etc/corosync/qdevice/net/nssdb/qdevice-net-node.crq

INFO: copying exported cert request to qnetd server

INFO: sign and export cluster cert
Signing cluster certificate
Certificate stored in /etc/corosync/qnetd/nssdb/cluster-modal.crt

INFO: copy exported CRT

INFO: import certificate
Importing signed cluster certificate
Notice: Trust flag u is set automatically if the private key is present.
pk12util: PKCS12 EXPORT SUCCESSFUL
Certificate stored in /etc/corosync/qdevice/net/nssdb/qdevice-net-node.p12

INFO: copy and import pk12 cert to all nodes

node 'srv-pve-01': Importing cluster certificate and key
node 'srv-pve-01': pk12util: PKCS12 IMPORT SUCCESSFUL
node 'srv-pve-02': Importing cluster certificate and key
node 'srv-pve-02': pk12util: PKCS12 IMPORT SUCCESSFUL
INFO: add QDevice to cluster configuration

INFO: start and enable corosync qdevice daemon on node 'srv-pve-01'...
Created symlink /etc/systemd/system/multi-user.target.wants/corosync-qdevice.service -> /lib/systemd/system/corosync-qdevice.service.

INFO: start and enable corosync qdevice daemon on node 'srv-pve-02'...
Created symlink /etc/systemd/system/multi-user.target.wants/corosync-qdevice.service -> /lib/systemd/system/corosync-qdevice.service.
Reloading corosync.conf...
Done

Here we run into the problem:

Code:
service corosync-qdevice status
● corosync-qdevice.service - Corosync Qdevice daemon
   Loaded: loaded (/lib/systemd/system/corosync-qdevice.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2020-11-11 20:32:44 CET; 29s ago
     Docs: man:corosync-qdevice
  Process: 25363 ExecStart=/usr/sbin/corosync-qdevice -f $COROSYNC_QDEVICE_OPTIONS (code=exited, status=2)
 Main PID: 25363 (code=exited, status=2)

Nov 11 20:32:44 srv-pve-02 systemd[1]: corosync-qdevice.service: Service RestartSec=100ms expired, scheduling restart.
Nov 11 20:32:44 srv-pve-02 systemd[1]: corosync-qdevice.service: Scheduled restart job, restart counter is at 6.
Nov 11 20:32:44 srv-pve-02 systemd[1]: Stopped Corosync Qdevice daemon.
Nov 11 20:32:44 srv-pve-02 systemd[1]: corosync-qdevice.service: Start request repeated too quickly.
Nov 11 20:32:44 srv-pve-02 systemd[1]: corosync-qdevice.service: Failed with result 'exit-code'.
Nov 11 20:32:44 srv-pve-02 systemd[1]: Failed to start Corosync Qdevice daemon.

And the error in the if I run the command manually in debug mode is 'error 18 disconnected from server.'

Code:
# /usr/sbin/corosync-qdevice -f -d
Nov 11 20:39:45 debug   Initializing votequorum
Nov 11 20:39:45 debug   Initializing local socket
Nov 11 20:39:45 debug   Registering qdevice models
Nov 11 20:39:45 debug   Configuring qdevice
Nov 11 20:39:45 debug   Configuring master_wins
Nov 11 20:39:45 debug   Getting configuration node list
Nov 11 20:39:45 debug   Initializing qdevice model
Nov 11 20:39:45 debug   Initializing qdevice_net_instance
Nov 11 20:39:45 debug   Registering algorithms
Nov 11 20:39:45 debug   Initializing NSS
Nov 11 20:39:45 debug   Cast vote timer remains stopped.
Nov 11 20:39:45 debug   Initializing cmap tracking
Nov 11 20:39:45 debug   Waiting for ring id
Nov 11 20:39:45 debug   Votequorum nodelist notify callback:
Nov 11 20:39:45 debug     Ring_id = (1.12)
Nov 11 20:39:45 debug     Node list (size = 2):
Nov 11 20:39:45 debug       0 nodeid = 1
Nov 11 20:39:45 debug       1 nodeid = 2
Nov 11 20:39:45 debug   Algorithm decided to pause cast vote timer and result vote is No change
Nov 11 20:39:45 debug   Cast vote timer is now paused.
Nov 11 20:39:45 debug   Votequorum quorum notify callback:
Nov 11 20:39:45 debug     Quorate = 1
Nov 11 20:39:45 debug     Node list (size = 3):
Nov 11 20:39:45 debug       0 nodeid = 1, state = 1
Nov 11 20:39:45 debug       1 nodeid = 2, state = 1
Nov 11 20:39:45 debug       2 nodeid = 0, state = 0
Nov 11 20:39:45 debug   Algorithm decided to not send list and result vote is No change
Nov 11 20:39:45 debug   Waiting for initial heuristics exec result
Nov 11 20:39:45 debug   worker: Heuristic worker initialized
Nov 11 20:39:45 debug   worker: qdevice_heuristics_worker_cmd_process_one_line: Received exec-list-clear command
Nov 11 20:39:45 debug   worker: qdevice_heuristics_worker_cmd_process_exec: Received exec command with seq_no "1" and timeout "15000"
Nov 11 20:39:45 debug   Received heuristics exec result command with seq_no "1" and result "Disabled"
Nov 11 20:39:45 debug   Votequorum heuristics exec result callback:
Nov 11 20:39:45 debug     seq_number = 1, exec_result = Disabled
Nov 11 20:39:45 debug   Algorithm decided to not send list, result vote is No change and heuristics is Undefined
Nov 11 20:39:45 debug   Cast vote timer is no longer paused.
Nov 11 20:39:45 debug   Not scheduling heuristics timer because mode is not enabled
Nov 11 20:39:45 debug   Running qdevice model
Nov 11 20:39:45 debug   Executing qdevice-net
Nov 11 20:39:45 debug   Trying connect to qnetd server 10.192.3.243:5403 (timeout = 8000ms)
Nov 11 20:39:45 debug   Sending preinit msg to qnetd
Nov 11 20:39:45 debug   Received preinit reply msg
Nov 11 20:39:45 debug   Sending client auth data.
Nov 11 20:39:45 debug   Received init reply msg
Nov 11 20:39:45 debug   Scheduling send of heartbeat every 8000ms
Nov 11 20:39:45 debug   Executing after-connect heuristics.
Nov 11 20:39:45 debug   worker: qdevice_heuristics_worker_cmd_process_exec: Received exec command with seq_no "2" and timeout "5000"
Nov 11 20:39:45 debug   Received heuristics exec result command with seq_no "2" and result "Disabled"
Nov 11 20:39:45 debug   Algorithm decided to send config node list, send membership node list, send quorum node list, heuristics is Undefined and result vote is Wait for reply
Nov 11 20:39:45 debug   Sending config node list seq = 4
Nov 11 20:39:45 debug     Node list:
Nov 11 20:39:45 debug   Sending membership node list seq = 5, ringid = (1.12), heuristics = Undefined.
Nov 11 20:39:45 debug     Node list:
Nov 11 20:39:45 debug       0 node_id = 1, data_center_id = 0, node_state = not set
Nov 11 20:39:45 debug       1 node_id = 2, data_center_id = 0, node_state = not set
Nov 11 20:39:45 debug   Sending quorum node list seq = 6, quorate = 1
Nov 11 20:39:45 debug     Node list:
Nov 11 20:39:45 debug       0 node_id = 1, data_center_id = 0, node_state = member
Nov 11 20:39:45 debug       1 node_id = 2, data_center_id = 0, node_state = member
Nov 11 20:39:45 debug   Cast vote timer remains stopped.
Nov 11 20:39:45 error   Received server error 18. Disconnecting from server
Nov 11 20:39:45 debug   Algorithm result vote is NACK
Nov 11 20:39:45 debug   Cast vote timer is now scheduled every 5000ms voting NACK.
Nov 11 20:39:45 debug   Removing cmap tracking
Nov 11 20:39:45 debug   Destroying qdevice model
Nov 11 20:39:45 debug   Destroying algorithm
Nov 11 20:39:45 debug   Destroying qdevice_net_instance
Nov 11 20:39:45 debug   Shutting down NSS
Nov 11 20:39:45 debug   Destroying qdevice ipc
Nov 11 20:39:45 debug   Destroying votequorum and cmap
Nov 11 20:39:45 debug   Destroying heuristics
Nov 11 20:39:45 debug   Waiting for heuristics worker to finish
Nov 11 20:39:45 debug   worker: Heuristic worker shutdown requested
Nov 11 20:39:45 debug   worker: Waiting for all processes to exit
Nov 11 20:39:45 debug   Closing log

Seems the service wont start for some reason, is this a bug? I cant find any communication issues between the nodes and Q but I am open to suggestions where to look next, everything seems ok apart from the service wont start on any of the nodes.
 
Tried a few more things this morning but none bore fruit:
- Check ports 5403 open on all nodes+qdevice
- Telnet from each node to 5403 on QDevice
- reboot QDevice and confirm telnet again

Seems network is OK - is there a chance the qdevice version in Debian 10 repos and the one in Proxmox 6.2 are not compatible?
 
18 would be INVALID_CONFIG_NODE_LIST, which the qnetd daemon returns if it receives an empty nodelist from the client, or one without the client (this should be visible on the qnetd side in the log?). could you post your corosync.conf?
 
Hi there sure, I just saw this as well which is exactly along the lines you mentioned: https://www.mail-archive.com/users@clusterlabs.org/msg07326.html

Code:
# cat /etc/corosync/corosync.conf 
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: srv-pve-01
    nodeid: 2
    quorum_votes: 1
    ring5_addr: 10.192.3.45
  }
  node {
    name: srv-pve-02
    nodeid: 1
    quorum_votes: 1
    ring5_addr: 10.192.3.46
  }
}

quorum {
  device {
    model: net
    net {
      algorithm: ffsplit
      host: 10.192.3.243
      tls: on
    }
    votes: 1
  }
  provider: corosync_votequorum
}

totem {
  cluster_name: modal
  config_version: 11
  interface {
    linknumber: 5
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}
 
Just for completeness here is the debug log from the qnetd on qdevice when I restart the qdevice service on one of the nodes:

To be clear it seems the issue in the mailing list above is similar but different. My nodelist in corosync.conf is above the quorum entries and does have the correct syntax so something else should be the problem or do you spot something I am missing in that config?

Code:
root@qdevice:~/.ssh# corosync-qnetd -f -d
Nov 12 12:29:48 debug   Initializing nss
Nov 12 12:29:48 debug   Initializing local socket
Nov 12 12:29:48 debug   Creating listening socket
Nov 12 12:29:48 debug   Registering algorithms
Nov 12 12:29:48 debug   QNetd ready to provide service
Nov 12 12:30:01 debug   New client connected
Nov 12 12:30:01 debug     cluster name = modal
Nov 12 12:30:01 debug     tls started = 1
Nov 12 12:30:01 debug     tls peer certificate verified = 1
Nov 12 12:30:01 debug     node_id = 1
Nov 12 12:30:01 debug     pointer = 0x18576c8
Nov 12 12:30:01 debug     addr_str = ::ffff:10.192.3.46:44332
Nov 12 12:30:01 debug     ring id = (1.12)
Nov 12 12:30:01 debug     cluster dump:
Nov 12 12:30:01 debug       client = ::ffff:10.192.3.46:44332, node_id = 1
Nov 12 12:30:01 debug   Client ::ffff:10.192.3.46:44332 (cluster modal, node_id 1) sent initial node list.
Nov 12 12:30:01 debug     msg seq num = 4
Nov 12 12:30:01 debug     config version = 11
Nov 12 12:30:01 debug     node list:
Nov 12 12:30:01 error   ffsplit: Received empty config node list for client ::ffff:10.192.3.46:44332
Nov 12 12:30:01 error   Algorithm returned error code. Sending error reply.
Nov 12 12:30:01 debug   Client ::ffff:10.192.3.46:44332 (cluster modal, node_id 1) sent membership node list.
Nov 12 12:30:01 debug     msg seq num = 5
Nov 12 12:30:01 debug     ring id = (1.12)
Nov 12 12:30:01 debug     heuristics = Undefined
Nov 12 12:30:01 debug     node list:
Nov 12 12:30:01 debug       node_id = 1, data_center_id = 0, node_state = not set
Nov 12 12:30:01 debug       node_id = 2, data_center_id = 0, node_state = not set
Nov 12 12:30:01 debug   Algorithm result vote is Ask later
Nov 12 12:30:01 debug   Client ::ffff:10.192.3.46:44332 (cluster modal, node_id 1) sent quorum node list.
Nov 12 12:30:01 debug     msg seq num = 6
Nov 12 12:30:01 debug     quorate = 1
Nov 12 12:30:01 debug     node list:
Nov 12 12:30:01 debug       node_id = 2, data_center_id = 0, node_state = member
Nov 12 12:30:01 debug       node_id = 1, data_center_id = 0, node_state = member
Nov 12 12:30:01 debug   Algorithm result vote is No change
Nov 12 12:30:01 debug   Client closed connection
Nov 12 12:30:01 debug   Client ::ffff:10.192.3.46:44332 (init_received 1, cluster modal, node_id 1) disconnect
Nov 12 12:30:01 debug   ffsplit: Membership for cluster modal is now stable
Nov 12 12:30:01 debug   ffsplit: No quorate partition was selected
Nov 12 12:30:01 debug   ffsplit: No client gets NACK
Nov 12 12:30:01 debug   ffsplit: No client gets ACK
Nov 12 12:30:01 debug   New client connected
Nov 12 12:30:01 debug     cluster name = modal
Nov 12 12:30:01 debug     tls started = 1
Nov 12 12:30:01 debug     tls peer certificate verified = 1
Nov 12 12:30:01 debug     node_id = 1
Nov 12 12:30:01 debug     pointer = 0x18576c8
Nov 12 12:30:01 debug     addr_str = ::ffff:10.192.3.46:44348
Nov 12 12:30:01 debug     ring id = (1.12)
Nov 12 12:30:01 debug     cluster dump:
Nov 12 12:30:01 debug       client = ::ffff:10.192.3.46:44348, node_id = 1
Nov 12 12:30:01 debug   Client ::ffff:10.192.3.46:44348 (cluster modal, node_id 1) sent initial node list.
Nov 12 12:30:01 debug     msg seq num = 4
Nov 12 12:30:01 debug     config version = 11
Nov 12 12:30:01 debug     node list:
Nov 12 12:30:01 error   ffsplit: Received empty config node list for client ::ffff:10.192.3.46:44348
Nov 12 12:30:01 error   Algorithm returned error code. Sending error reply.
Nov 12 12:30:01 debug   Client ::ffff:10.192.3.46:44348 (cluster modal, node_id 1) sent membership node list.
Nov 12 12:30:01 debug     msg seq num = 5
Nov 12 12:30:01 debug     ring id = (1.12)
Nov 12 12:30:01 debug     heuristics = Undefined
Nov 12 12:30:01 debug     node list:
Nov 12 12:30:01 debug       node_id = 1, data_center_id = 0, node_state = not set
Nov 12 12:30:01 debug       node_id = 2, data_center_id = 0, node_state = not set
Nov 12 12:30:01 debug   Algorithm result vote is Ask later
Nov 12 12:30:01 debug   Client ::ffff:10.192.3.46:44348 (cluster modal, node_id 1) sent quorum node list.
Nov 12 12:30:01 debug     msg seq num = 6
Nov 12 12:30:01 debug     quorate = 1
Nov 12 12:30:01 debug     node list:
Nov 12 12:30:01 debug       node_id = 2, data_center_id = 0, node_state = member
Nov 12 12:30:01 debug       node_id = 1, data_center_id = 0, node_state = member
Nov 12 12:30:01 debug   Algorithm result vote is No change
Nov 12 12:30:01 debug   Client closed connection
Nov 12 12:30:01 debug   Client ::ffff:10.192.3.46:44348 (init_received 1, cluster modal, node_id 1) disconnect
Nov 12 12:30:01 debug   ffsplit: Membership for cluster modal is now stable
Nov 12 12:30:01 debug   ffsplit: No quorate partition was selected
Nov 12 12:30:01 debug   ffsplit: No client gets NACK
Nov 12 12:30:01 debug   ffsplit: No client gets ACK
Nov 12 12:30:02 debug   New client connected
Nov 12 12:30:02 debug     cluster name = modal
Nov 12 12:30:02 debug     tls started = 1
Nov 12 12:30:02 debug     tls peer certificate verified = 1
Nov 12 12:30:02 debug     node_id = 1
Nov 12 12:30:02 debug     pointer = 0x18576c8
Nov 12 12:30:02 debug     addr_str = ::ffff:10.192.3.46:44354
Nov 12 12:30:02 debug     ring id = (1.12)
Nov 12 12:30:02 debug     cluster dump:
Nov 12 12:30:02 debug       client = ::ffff:10.192.3.46:44354, node_id = 1
Nov 12 12:30:02 debug   Client ::ffff:10.192.3.46:44354 (cluster modal, node_id 1) sent initial node list.
Nov 12 12:30:02 debug     msg seq num = 4
Nov 12 12:30:02 debug     config version = 11
Nov 12 12:30:02 debug     node list:
Nov 12 12:30:02 error   ffsplit: Received empty config node list for client ::ffff:10.192.3.46:44354
Nov 12 12:30:02 error   Algorithm returned error code. Sending error reply.
Nov 12 12:30:02 debug   Client ::ffff:10.192.3.46:44354 (cluster modal, node_id 1) sent membership node list.
Nov 12 12:30:02 debug     msg seq num = 5
Nov 12 12:30:02 debug     ring id = (1.12)
Nov 12 12:30:02 debug     heuristics = Undefined
Nov 12 12:30:02 debug     node list:
Nov 12 12:30:02 debug       node_id = 1, data_center_id = 0, node_state = not set
Nov 12 12:30:02 debug       node_id = 2, data_center_id = 0, node_state = not set
Nov 12 12:30:02 debug   Algorithm result vote is Ask later
Nov 12 12:30:02 debug   Client ::ffff:10.192.3.46:44354 (cluster modal, node_id 1) sent quorum node list.
Nov 12 12:30:02 debug     msg seq num = 6
Nov 12 12:30:02 debug     quorate = 1
Nov 12 12:30:02 debug     node list:
Nov 12 12:30:02 debug       node_id = 2, data_center_id = 0, node_state = member
Nov 12 12:30:02 debug       node_id = 1, data_center_id = 0, node_state = member
Nov 12 12:30:02 debug   Algorithm result vote is No change
Nov 12 12:30:02 debug   Client closed connection
Nov 12 12:30:02 debug   Client ::ffff:10.192.3.46:44354 (init_received 1, cluster modal, node_id 1) disconnect
Nov 12 12:30:02 debug   ffsplit: Membership for cluster modal is now stable
Nov 12 12:30:02 debug   ffsplit: No quorate partition was selected
Nov 12 12:30:02 debug   ffsplit: No client gets NACK
Nov 12 12:30:02 debug   ffsplit: No client gets ACK
Nov 12 12:30:02 debug   New client connected
Nov 12 12:30:02 debug     cluster name = modal
Nov 12 12:30:02 debug     tls started = 1
Nov 12 12:30:02 debug     tls peer certificate verified = 1
Nov 12 12:30:02 debug     node_id = 1
Nov 12 12:30:02 debug     pointer = 0x18576c8
Nov 12 12:30:02 debug     addr_str = ::ffff:10.192.3.46:44366
Nov 12 12:30:02 debug     ring id = (1.12)
Nov 12 12:30:02 debug     cluster dump:
Nov 12 12:30:02 debug       client = ::ffff:10.192.3.46:44366, node_id = 1
Nov 12 12:30:02 debug   Client ::ffff:10.192.3.46:44366 (cluster modal, node_id 1) sent initial node list.
Nov 12 12:30:02 debug     msg seq num = 4
Nov 12 12:30:02 debug     config version = 11
Nov 12 12:30:02 debug     node list:
Nov 12 12:30:02 error   ffsplit: Received empty config node list for client ::ffff:10.192.3.46:44366
Nov 12 12:30:02 error   Algorithm returned error code. Sending error reply.
Nov 12 12:30:02 debug   Client ::ffff:10.192.3.46:44366 (cluster modal, node_id 1) sent membership node list.
Nov 12 12:30:02 debug     msg seq num = 5
Nov 12 12:30:02 debug     ring id = (1.12)
Nov 12 12:30:02 debug     heuristics = Undefined
Nov 12 12:30:02 debug     node list:
Nov 12 12:30:02 debug       node_id = 1, data_center_id = 0, node_state = not set
Nov 12 12:30:02 debug       node_id = 2, data_center_id = 0, node_state = not set
Nov 12 12:30:02 debug   Algorithm result vote is Ask later
Nov 12 12:30:02 debug   Client ::ffff:10.192.3.46:44366 (cluster modal, node_id 1) sent quorum node list.
Nov 12 12:30:02 debug     msg seq num = 6
Nov 12 12:30:02 debug     quorate = 1
Nov 12 12:30:02 debug     node list:
Nov 12 12:30:02 debug       node_id = 2, data_center_id = 0, node_state = member
Nov 12 12:30:02 debug       node_id = 1, data_center_id = 0, node_state = member
Nov 12 12:30:02 debug   Algorithm result vote is No change
Nov 12 12:30:02 debug   Client closed connection
Nov 12 12:30:02 debug   Client ::ffff:10.192.3.46:44366 (init_received 1, cluster modal, node_id 1) disconnect
Nov 12 12:30:02 debug   ffsplit: Membership for cluster modal is now stable
Nov 12 12:30:02 debug   ffsplit: No quorate partition was selected
Nov 12 12:30:02 debug   ffsplit: No client gets NACK
Nov 12 12:30:02 debug   ffsplit: No client gets ACK
Nov 12 12:30:02 debug   New client connected
Nov 12 12:30:02 debug     cluster name = modal
Nov 12 12:30:02 debug     tls started = 1
Nov 12 12:30:02 debug     tls peer certificate verified = 1
Nov 12 12:30:02 debug     node_id = 1
Nov 12 12:30:02 debug     pointer = 0x18576c8
Nov 12 12:30:02 debug     addr_str = ::ffff:10.192.3.46:44370
Nov 12 12:30:02 debug     ring id = (1.12)
Nov 12 12:30:02 debug     cluster dump:
Nov 12 12:30:02 debug       client = ::ffff:10.192.3.46:44370, node_id = 1
Nov 12 12:30:02 debug   Client ::ffff:10.192.3.46:44370 (cluster modal, node_id 1) sent initial node list.
Nov 12 12:30:02 debug     msg seq num = 4
Nov 12 12:30:02 debug     config version = 11
Nov 12 12:30:02 debug     node list:
Nov 12 12:30:02 error   ffsplit: Received empty config node list for client ::ffff:10.192.3.46:44370
Nov 12 12:30:02 error   Algorithm returned error code. Sending error reply.
Nov 12 12:30:02 debug   Client ::ffff:10.192.3.46:44370 (cluster modal, node_id 1) sent membership node list.
Nov 12 12:30:02 debug     msg seq num = 5
Nov 12 12:30:02 debug     ring id = (1.12)
Nov 12 12:30:02 debug     heuristics = Undefined
Nov 12 12:30:02 debug     node list:
Nov 12 12:30:02 debug       node_id = 1, data_center_id = 0, node_state = not set
Nov 12 12:30:02 debug       node_id = 2, data_center_id = 0, node_state = not set
Nov 12 12:30:02 debug   Algorithm result vote is Ask later
Nov 12 12:30:02 debug   Client ::ffff:10.192.3.46:44370 (cluster modal, node_id 1) sent quorum node list.
Nov 12 12:30:02 debug     msg seq num = 6
Nov 12 12:30:02 debug     quorate = 1
Nov 12 12:30:02 debug     node list:
Nov 12 12:30:02 debug       node_id = 2, data_center_id = 0, node_state = member
Nov 12 12:30:02 debug       node_id = 1, data_center_id = 0, node_state = member
Nov 12 12:30:02 debug   Algorithm result vote is No change
Nov 12 12:30:02 debug   Client closed connection
Nov 12 12:30:02 debug   Client ::ffff:10.192.3.46:44370 (init_received 1, cluster modal, node_id 1) disconnect
Nov 12 12:30:02 debug   ffsplit: Membership for cluster modal is now stable
Nov 12 12:30:02 debug   ffsplit: No quorate partition was selected
Nov 12 12:30:02 debug   ffsplit: No client gets NACK
Nov 12 12:30:02 debug   ffsplit: No client gets ACK
 
Last edited:
I think I found the issue:

https://github.com/corosync/corosync-qdevice/blob/master/qdevices/qdevice-cmap.c#L121
Code:
    while ((cs_err = cmap_iter_next(cmap_handle, iter_handle, key_name, NULL, NULL)) == CS_OK) {
        res = sscanf(key_name, "nodelist.node.%u.%s", &node_pos, tmp_key);
        if (res != 2) {
            continue;
        }

        if (strcmp(tmp_key, "ring0_addr") != 0) {
            continue;
        }

        snprintf(tmp_key, CMAP_KEYNAME_MAXLEN, "nodelist.node.%u.nodeid", node_pos);

looks to my like qdevice will skip any nodelist entries that don't have a ring0_addr defined.. could you confirm by switching to ring0 instead of 5?
 
OK! Yes it works flawlessly!

For posterity, one must use ring0_addr in corosync conf and any other value will not work - at least until this bug(?) is fixed.

Thank you Fabian for your laser focus on spotting the issue so fast!
 
  • Like
Reactions: fabian
opened an issue upstream, should not be too hard to fix but will take a while to trickle down into a release + packages..
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!