corosync-qnetd.service does not start

Stephan Becker · Feb 17, 2023

Hi all,

I have a four node cluster up and running.
Now I tried to add a qdevice that runns inside a docker container on a Synology NAS.

The containerized qdevice was successfully added to the cluster but gives no vote.
I can ssh into the device from any node and vice versa without any certificate issues.
So the problem seems not to be with the cluster.

Code:

root@pveNode0:~# pvecm status
Cluster information
-------------------
Name:             BSB-Datacenter
Config Version:   31
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Feb 17 01:48:17 2023
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000001
Ring ID:          1.335
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      4
Quorum:           3
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1   A,NV,NMW x.x.x.20 (local)
0x00000002          1   A,NV,NMW x.x.x.21
0x00000003          1   A,NV,NMW x.x.x.22
0x00000004          1   A,NV,NMW x.x.x.23
0x00000000          0            Qdevice (votes 1)
root@pveNode0:~#

The corosync configs are in sync accross all nodes.

Code:

root@pveNode0:~# cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pveNode0
    nodeid: 1
    quorum_votes: 1
    ring0_addr: x.x.x.20
  }
  node {
    name: pveNode1
    nodeid: 2
    quorum_votes: 1
    ring0_addr: x.x.x.21
  }
  node {
    name: pveNode2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: x.x.x.22
  }
  node {
    name: pveNode3
    nodeid: 4
    quorum_votes: 1
    ring0_addr: x.x.x.23
  }
}

quorum {
  device {
    model: net
    net {
      algorithm: ffsplit
      host: x.x.x.30
      tls: on
    }
    votes: 1
  }
  provider: corosync_votequorum
}

totem {
  cluster_name: BSB-Datacenter
  config_version: 31
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

After a bit of investigation I found that the qnetd service in the qdevice container does not start:

Code:

root@qdevice:~# systemctl start corosync-qnetd
Job for corosync-qnetd.service failed because the control process exited with error code.
See "systemctl status corosync-qnetd.service" and "journalctl -xe" for details.

root@qdevice:~# systemctl status corosync-qnetd
● corosync-qnetd.service - Corosync Qdevice Network daemon
     Loaded: loaded (/lib/systemd/system/corosync-qnetd.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Fri 2023-02-17 00:40:04 UTC; 11s ago
       Docs: man:corosync-qnetd
    Process: 59 ExecStart=/usr/bin/corosync-qnetd -f $COROSYNC_QNETD_OPTIONS (code=exited, status=1/FAILURE)
   Main PID: 59 (code=exited, status=1/FAILURE)

root@qdevice:~# journalctl -xe
-- No entries --

The Dockerfile the container is build from contains:

Code:

ARG  TAG=latest
FROM debian:${TAG}
RUN echo 'debconf debconf/frontend select teletype' | debconf-set-selections
RUN apt-get update
RUN apt-get dist-upgrade -y
RUN apt-get install -y --no-install-recommends \
        systemd        \
        systemd-sysv   \
        cron           \
        anacron           \
        corosync-qnetd \
        openssh-server \
        mc

RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
RUN echo 'root:i229l1NF!proxmox' | chpasswd
RUN chown -R coroqnetd:coroqnetd /etc/corosync/

RUN apt-get clean
RUN rm -rf                        \
    /var/lib/apt/lists/*          \
    /var/log/alternatives.log     \
    /var/log/apt/history.log      \
    /var/log/apt/term.log         \
    /var/log/dpkg.log

RUN systemctl mask --   \
    dev-hugepages.mount \
    sys-fs-fuse-connections.mount

RUN rm -f           \
    /etc/machine-id \
    /var/lib/dbus/machine-id

FROM debian:${TAG}
COPY --from=0 / /
ENV container docker
STOPSIGNAL SIGRTMIN+3
VOLUME [ "/sys/fs/cgroup", "/run", "/run/lock", "/tmp" ]
CMD [ "/sbin/init" ]

And the docker compose file is this one:

YAML:

version: "3.5"
services:
  qdevice:
    container_name: qdevice
    image: 'bsb/qdevice'
    build:
      context: ./context
      dockerfile: ./Dockerfile
    hostname: qdevice
    restart: unless-stopped
    volumes:
     - /volume1/docker/qnetd/corosync-data:/etc/corosync
     - /sys/fs/cgroup:/sys/fs/cgroup:ro
    ports:
      - '22:22'
      - '5403-5412:5403-5412/udp'
    networks:
     - qdevice-net

networks:
  qdevice-net:
    name: qdevice-net
    driver: bridge

My assumption now is that something is wrong with my container build or compose.yml.
Any Idea what may be the reason why the qnetd service is not starting at all?

Cheers
Stephan

shanreich · Feb 17, 2023

What is the journal output of the corosync-qnetd service in your container?

journalctl -u corosync-qnetd > output.txt

Stephan Becker · Feb 17, 2023

There are no entries...

Code:

Linux qdevice 3.10.108 #42962 SMP Tue Oct 18 15:05:36 CST 2022 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.

root@qdevice:~# ps -elf
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S root         1     0  2  80   0 -  4445 SyS_ep 09:09 ?        00:00:00 /sbin/init
0 S root        27     1  0  80   0 -  3290 SyS_ep 09:09 ?        00:00:00 /lib/systemd/systemd-networkd-wait-online
4 S systemd+    30     1  0  80   0 -  5298 SyS_ep 09:09 ?        00:00:00 /lib/systemd/systemd-resolved
0 S root        32     1  0  80   0 -   685 sigsus 09:09 ?        00:00:00 /usr/sbin/anacron -d -q -s
4 S root        33     1  0  80   0 -  1405 -      09:09 ?        00:00:00 /usr/sbin/cron -f
4 R root        35     1  1  80   0 -  3535 -      09:09 ?        00:00:00 sshd: root@pts/0
4 S root        38     1  3  80   0 -  3449 core_s 09:09 ?        00:00:00 sshd: root@notty
4 S root        43    35  0  80   0 -  1003 do_wai 09:09 pts/0    00:00:00 -bash
4 S root        50    38  0  80   0 -  1472 core_s 09:09 ?        00:00:00 /usr/lib/openssh/sftp-server
0 R root        55    43  0  80   0 -  1685 -      09:09 pts/0    00:00:00 ps -elf

root@qdevice:~# systemctl start corosync-qnetd
Job for corosync-qnetd.service failed because the control process exited with error code.
See "systemctl status corosync-qnetd.service" and "journalctl -xe" for details.

root@qdevice:~# systemctl status corosync-qnetd.service
● corosync-qnetd.service - Corosync Qdevice Network daemon
     Loaded: loaded (/lib/systemd/system/corosync-qnetd.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Fri 2023-02-17 09:11:21 UTC; 2min 0s ago
       Docs: man:corosync-qnetd
    Process: 60 ExecStart=/usr/bin/corosync-qnetd -f $COROSYNC_QNETD_OPTIONS (code=exited, status=1/FAILURE)
   Main PID: 60 (code=exited, status=1/FAILURE)

root@qdevice:~# journalctl -u corosync-qnetd > output.txt
root@qdevice:~# cat output.txt
-- No entries --

Since there is more or less nothing logged, it looks like a general launch issue of this demon.
But I have no clue what that might be.
Later the day I'll try to rund the service natively and dockerized on a RPi4B to see if that works.
If so the problem ist with the container itself or the container running on the Synology NAS.

Cheers
Stephan

Stephan Becker · Feb 17, 2023

Update:
The problem with the launching qnetd was the user the service was asked to run with or the other way around the access rigths to the folders it uses.
I changed the run user in the service definition from coroqnetd to root ...

Code:

root@qdevice:/lib/systemd/system# cat corosync-qnetd.service
[Unit]
Description=Corosync Qdevice Network daemon
Documentation=man:corosync-qnetd
ConditionKernelCommandLine=!nocluster
Requires=network-online.target
After=network-online.target

[Service]
EnvironmentFile=-/etc/default/corosync-qnetd
ExecStart=/usr/bin/corosync-qnetd -f $COROSYNC_QNETD_OPTIONS
Type=notify
StandardError=null
Restart=on-abnormal
#User=coroqnetd
User=root
RuntimeDirectory=corosync-qnetd
RuntimeDirectoryMode=0770
PrivateTmp=yes

[Install]
WantedBy=multi-user.target
root@qdevice:/lib/systemd/system#

Now the service starts up properly ...

Code:

root@qdevice:~# systemctl status corosync-qnetd
● corosync-qnetd.service - Corosync Qdevice Network daemon
     Loaded: loaded (/lib/systemd/system/corosync-qnetd.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2023-02-17 09:34:04 UTC; 1min 59s ago
       Docs: man:corosync-qnetd
   Main PID: 226 (corosync-qnetd)
     Memory: 8.3M
     CGroup: /docker/a1db405358c3ab6c14974e52a35db9d5ffa86f6c81eb820669c4ebb37b272f76/system.slice/corosync-qnetd.service
             └─226 /usr/bin/corosync-qnetd -f

But still no votes are reported at the cluster....

Code:

root@pveNode0:~# pvecm status
Cluster information
-------------------
Name:             BSB-Datacenter
Config Version:   33
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Fri Feb 17 10:35:47 2023
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000001
Ring ID:          1.335
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      4
Quorum:           3
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1   A,NV,NMW x.x.x.20 (local)
0x00000002          1   A,NV,NMW x.x.x.21
0x00000003          1   A,NV,NMW x.x.x.22
0x00000004          1   A,NV,NMW x.x.x.23
0x00000000          0            Qdevice (votes 1)
root@pveNode0:~#

One Step further but still not there yet....
Any ideas what to check next?

Cheers
Stephan

shanreich · Feb 17, 2023

I assume you followed the steps outlined in our wiki? [1] Make sure to double-check whether you might have overlooked something there. Particularly with regard to installing the corosync-qdevice package on all cluster nodes.

The service should be started when adding the QDevice to the cluster via pvecm, without any need for manual starting. So maybe you could try stopping the service, removing the QDevice from the cluster and then adding it again after making sure that you didn't miss anything in the wiki description?

[1] https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_external_vote_support

shanreich · Feb 17, 2023

One additional thing I noticed: Since /etc/corosync is a mountpoint - does this folder have the proper permissions? You could try running corosync-qnetd manually via CLI and then you should get log output. Maybe there is some interesting information there.

Stephan Becker · Feb 18, 2023

shanreich said:
One additional thing I noticed: Since /etc/corosync is a mountpoint - does this folder have the proper permissions? You could try running corosync-qnetd manually via CLI and then you should get log output. Maybe there is some interesting information there.

I tried that.
No log output at all.
The service is running - all green, but no votes.
Running as root, there shall be no access priviledges issues anyhow, right?

Intermediate Results:

As I promised, I did install the qdevice natively on a RPi4 (as a reference installation) and with that evertything worked!
Though there were a lot of certificate issues and such as usual...

At the end I was able to add/remove/add the new qdevice without any error messages.
And yes, it now provides the quorum vote I was looking for.

That gives me some confidence, that my 4 node cluster shall be ok and the NAS container installation is the root cause of the problem described above.
And yes, the permissons of the mount point and those of some other folders need to be checked.
With a working cluster and a working RPi4 qdevice I'll dig into that next.

First, I need to make sure that the new reference qdevice and the cluster is fine.
Maybe somebody can tell me if the follwoing behavoir is as intended.

With the qdevice set up I did a shutdown on pveNode2 and pveNode3 (#3 and #4 of those 4).
The status of the cluster manager after that was as follows:

Code:

Quorum information
------------------
Date:             Sat Feb 18 00:25:14 2023
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.391
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      3
Quorum:           3 
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1    A,V,NMW x.x.x.20 (local)
0x00000002          1    A,V,NMW x.x.x.21
0x00000000          1            Qdevice

Triggering the shutdown of another Node (3 of 4 now) initiated the Migration of all the services from pveNode1 to pveNode0 as expected.
Rigtht after that pveNode0 did a reboot as well and after it came up again all VMs and LXCs did not start.

Looking at the CM status I got the following:

Code:

Quorum information
------------------
Date:             Sat Feb 18 00:29:52 2023
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.395
Quorate:          No

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      1
Quorum:           3 Activity blocked
Flags:            Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
0x00000001          1   A,NV,NMW 192.168.40.20 (local)
0x00000000          0            Qdevice (votes 1)

Quroum is gone with 3 of 4 nodes down, no votes from the qdevice anymore.
This behaviour aligns perfectly with the Wiki and the described behaviour of a cluster with an even number of nodes + qdevice.
So far so good.
But what about that reboot of the last Node after quorum loss?
I did not find anything about that in the docs...
Is that kind of a security measure (fencing) to halt all guests on a node that has become isolated or like from the cluster?
Just pulling the network cable of a node leads to the same behaviour of that node, right?

Cheers
Stephan

esi_y · Dec 20, 2023

Stephan Becker said:

YAML:

version: "3.5"
services:
  qdevice:
    container_name: qdevice
    image: 'bsb/qdevice'
    build:
      context: ./context
      dockerfile: ./Dockerfile
    hostname: qdevice
    restart: unless-stopped
    volumes:
     - /volume1/docker/qnetd/corosync-data:/etc/corosync
     - /sys/fs/cgroup:/sys/fs/cgroup:ro
    ports:
      - '22:22'
      - '5403-5412:5403-5412/udp'
    networks:
     - qdevice-net

networks:
  qdevice-net:
    name: qdevice-net
    driver: bridge

My assumption now is that something is wrong with my container build or compose.yml.
Any Idea what may be the reason why the qnetd service is not starting at all?

The 5403 is TCP, unlike all the others.

Search

Search

corosync-qnetd.service does not start

Stephan Becker

New Member

shanreich

Proxmox Staff Member

Stephan Becker

New Member

Stephan Becker

New Member

shanreich

Proxmox Staff Member

shanreich

Proxmox Staff Member

Stephan Becker

New Member

esi_y

Active Member