Lost quorum while all nodes are online

Mephisto aD · Jun 27, 2021

Hello there,

last week I updated my PVE two node Cluster, but after rebooting since a while i ran into some problems:

Both nodes currently are online, but keep complaning that there is no quorum.

In the webui this looks like this :

Anyway from the webui of both nodes i get the actual informations about the other node, even uptime and resources stats are shown in the summary menu.

Following some research results this must have to do with the corosync.service.

So here you see the corosync.service which is identical on both nodes:

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: CALLISTO
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 192.168.2.3
  }
  node {
    name: GANYMED
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.2.10
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: JUPITER
  config_version: 3
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 3
}

anyway the service must have failed by any reason:

Code:

root@GANYMED:~# corosync-cfgtool -s
Local node ID 1, transport knet
LINK ID 0
        addr    = 192.168.2.10
        status:
                nodeid:   1:    connected
                nodeid:   2:    localhost
               

root@CALLISTO:~# corosync-cfgtool -s
Could not initialize corosync configuration API error 2

An here is where the things get spooky in my eyes:
so if i get this right from the corosync.service the node ganymed has the nodeid 1, anyway it's connected to himself as nodeid 2 as well?

In case that's true the corosync service seems to have failed by the shown mismatch, right?

Does anybody know how to debug this further or how to resolve this?

Thanks a lot for any kind of Ideas
Maphisto

fiona · Jun 28, 2021

Hi,
what is the output of journalctl -u corosync.service -b0, pvecm status and pveversion -v on both nodes? Please also check your /var/log/syslog.

Mephisto aD · Jun 28, 2021

Hi Fabian,

of course. Here you see the output from both nodes:

Code:

root@GANYMED:~# journalctl -u corosync.service -b0
-- Logs begin at Thu 2021-06-24 13:12:13 CEST, end at Mon 2021-06-28 11:56:15 CEST. --
Jun 24 13:13:45 GANYMED systemd[1]: Starting Corosync Cluster Engine...
Jun 24 13:13:45 GANYMED corosync[1318]:   [MAIN  ] Corosync Cluster Engine 3.1.2 starting up
Jun 24 13:13:45 GANYMED corosync[1318]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Jun 24 13:13:45 GANYMED corosync[1318]:   [TOTEM ] Initializing transport (Kronosnet).
Jun 24 13:13:45 GANYMED corosync[1318]:   [TOTEM ] totemknet initialized
Jun 24 13:13:45 GANYMED corosync[1318]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Jun 24 13:13:45 GANYMED corosync[1318]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Jun 24 13:13:45 GANYMED corosync[1318]:   [QB    ] server name: cmap
Jun 24 13:13:45 GANYMED corosync[1318]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Jun 24 13:13:45 GANYMED corosync[1318]:   [QB    ] server name: cfg
Jun 24 13:13:45 GANYMED corosync[1318]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Jun 24 13:13:45 GANYMED corosync[1318]:   [QB    ] server name: cpg
Jun 24 13:13:45 GANYMED corosync[1318]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Jun 24 13:13:45 GANYMED corosync[1318]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Jun 24 13:13:45 GANYMED corosync[1318]:   [WD    ] Watchdog not enabled by configuration
Jun 24 13:13:45 GANYMED corosync[1318]:   [WD    ] resource load_15min missing a recovery key.
Jun 24 13:13:45 GANYMED corosync[1318]:   [WD    ] resource memory_used missing a recovery key.
Jun 24 13:13:45 GANYMED corosync[1318]:   [WD    ] no resources configured.
Jun 24 13:13:45 GANYMED corosync[1318]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Jun 24 13:13:45 GANYMED corosync[1318]:   [QUORUM] Using quorum provider corosync_votequorum
Jun 24 13:13:45 GANYMED corosync[1318]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Jun 24 13:13:45 GANYMED corosync[1318]:   [QB    ] server name: votequorum
Jun 24 13:13:45 GANYMED corosync[1318]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Jun 24 13:13:45 GANYMED corosync[1318]:   [QB    ] server name: quorum
Jun 24 13:13:45 GANYMED corosync[1318]:   [TOTEM ] Configuring link 0
Jun 24 13:13:45 GANYMED corosync[1318]:   [TOTEM ] Configured link number 0: local addr: 192.168.2.10, port=5405
Jun 24 13:13:45 GANYMED corosync[1318]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
Jun 24 13:13:45 GANYMED corosync[1318]:   [KNET  ] host: host: 2 has no active links
Jun 24 13:13:45 GANYMED corosync[1318]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 24 13:13:45 GANYMED corosync[1318]:   [KNET  ] host: host: 2 has no active links
Jun 24 13:13:45 GANYMED corosync[1318]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun 24 13:13:45 GANYMED corosync[1318]:   [KNET  ] host: host: 2 has no active links
Jun 24 13:13:45 GANYMED corosync[1318]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Jun 24 13:13:45 GANYMED corosync[1318]:   [KNET  ] host: host: 1 has no active links
Jun 24 13:13:45 GANYMED corosync[1318]:   [QUORUM] Sync members[1]: 1
Jun 24 13:13:45 GANYMED corosync[1318]:   [QUORUM] Sync joined[1]: 1
Jun 24 13:13:45 GANYMED corosync[1318]:   [TOTEM ] A new membership (1.1f4) was formed. Members joined: 1
Jun 24 13:13:45 GANYMED corosync[1318]:   [QUORUM] Members[1]: 1
Jun 24 13:13:45 GANYMED corosync[1318]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 24 13:13:45 GANYMED systemd[1]: Started Corosync Cluster Engine.
Jun 24 13:35:41 GANYMED systemd[1]: Stopping Corosync Cluster Engine...
Jun 24 13:35:41 GANYMED corosync[1318]:   [CFG   ] Node 1 was shut down by sysadmin
Jun 24 13:35:41 GANYMED corosync[1318]:   [SERV  ] Unloading all Corosync service engines.
Jun 24 13:35:41 GANYMED corosync-cfgtool[4927]: Shutting down corosync
Jun 24 13:35:41 GANYMED corosync[1318]:   [QB    ] withdrawing server sockets
Jun 24 13:35:41 GANYMED corosync[1318]:   [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Jun 24 13:35:41 GANYMED corosync[1318]:   [QB    ] withdrawing server sockets
Jun 24 13:35:41 GANYMED corosync[1318]:   [SERV  ] Service engine unloaded: corosync configuration map access
Jun 24 13:35:41 GANYMED corosync[1318]:   [QB    ] withdrawing server sockets
Jun 24 13:35:41 GANYMED corosync[1318]:   [SERV  ] Service engine unloaded: corosync configuration service
Jun 24 13:35:41 GANYMED corosync[1318]:   [MAIN  ] Node was shut down by a signal
Jun 24 13:35:41 GANYMED corosync[1318]:   [QB    ] withdrawing server sockets

root@CALLISTO:~# journalctl -u corosync.service -b0
-- Logs begin at Sun 2021-06-27 20:46:01 CEST, end at Mon 2021-06-28 11:54:07 CEST. --
-- No entries --

Code:

root@GANYMED:~# pvecm status
Cluster information
-------------------
Name:             JUPITER
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Jun 28 11:58:01 2021
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000001
Ring ID:          1.1fe
Quorate:          No

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.2.10 (local)

root@CALLISTO:~# pvecm status
Cluster information
-------------------
Name:             JUPITER
Config Version:   3
Transport:        knet
Secure auth:      on

Cannot initialize CMAP service

Code:

root@GANYMED:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.119-1-pve)
pve-manager: 6.4-8 (running version: 6.4-8/185e14db)
pve-kernel-5.4: 6.4-3
pve-kernel-helper: 6.4-3
pve-kernel-5.4.119-1-pve: 5.4.119-1
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.10-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-6
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

root@CALLISTO:~# pveversion -v
proxmox-ve: 6.4-1 (running kernel: 5.4.119-1-pve)
pve-manager: 6.4-8 (running version: 6.4-8/185e14db)
pve-kernel-5.4: 6.4-3
pve-kernel-helper: 6.4-3
pve-kernel-5.4.119-1-pve: 5.4.119-1
pve-kernel-5.4.114-1-pve: 5.4.114-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.2-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.0.3-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-3
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.10-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.5-6
pve-cluster: 6.4-1
pve-container: 3.3-5
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.2-4
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

the content of /var/log/syslog mainly repeats the same things again and again:

Code:

GANYMED:

Jun 28 12:04:00 GANYMED systemd[1]: Starting Proxmox VE replication runner...
Jun 28 12:04:00 GANYMED pvesr[26577]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 28 12:04:01 GANYMED pvesr[26577]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 28 12:04:02 GANYMED pvesr[26577]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 28 12:04:03 GANYMED pvesr[26577]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 28 12:04:04 GANYMED pvesr[26577]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 28 12:04:05 GANYMED pvesr[26577]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 28 12:04:06 GANYMED pvestatd[1335]: storage 'pve-share' is not online
Jun 28 12:04:06 GANYMED pvesr[26577]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 28 12:04:07 GANYMED pvesr[26577]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 28 12:04:08 GANYMED pvesr[26577]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 28 12:04:09 GANYMED pvesr[26577]: cfs-lock 'file-replication_cfg' error: no quorum!
Jun 28 12:04:09 GANYMED systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jun 28 12:04:09 GANYMED systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jun 28 12:04:09 GANYMED systemd[1]: Failed to start Proxmox VE replication runner.
Jun 28 12:04:15 GANYMED pvestatd[1335]: storage 'pve-share' is not online
Jun 28 12:04:25 GANYMED pvestatd[1335]: storage 'pve-share' is not online
Jun 28 12:04:35 GANYMED pvestatd[1335]: storage 'pve-share' is not online
Jun 28 12:04:46 GANYMED pvestatd[1335]: storage 'pve-share' is not online
Jun 28 12:04:55 GANYMED pvestatd[1335]: storage 'pve-share' is not online

CALLISTO:

Jun 28 12:07:00 CALLISTO systemd[1]: Starting Proxmox VE replication runner...
Jun 28 12:07:01 CALLISTO pvesr[4721]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 28 12:07:02 CALLISTO pvesr[4721]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 28 12:07:03 CALLISTO pvesr[4721]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 28 12:07:04 CALLISTO pvesr[4721]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 28 12:07:05 CALLISTO pvesr[4721]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 28 12:07:06 CALLISTO pvesr[4721]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 28 12:07:07 CALLISTO pvesr[4721]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 28 12:07:08 CALLISTO pvesr[4721]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 28 12:07:09 CALLISTO pvesr[4721]: trying to acquire cfs lock 'file-replication_cfg' ...
Jun 28 12:07:09 CALLISTO pvestatd[1089]: storage 'pve-share' is not online
Jun 28 12:07:10 CALLISTO pvesr[4721]: cfs-lock 'file-replication_cfg' error: no quorum!
Jun 28 12:07:10 CALLISTO systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Jun 28 12:07:10 CALLISTO systemd[1]: pvesr.service: Failed with result 'exit-code'.
Jun 28 12:07:10 CALLISTO systemd[1]: Failed to start Proxmox VE replication runner.
Jun 28 12:07:19 CALLISTO pvestatd[1089]: storage 'pve-share' is not online
Jun 28 12:07:30 CALLISTO pvestatd[1089]: storage 'pve-share' is not online
Jun 28 12:07:39 CALLISTO pvestatd[1089]: storage 'pve-share' is not online
Jun 28 12:07:50 CALLISTO pvestatd[1089]: storage 'pve-share' is not online
Jun 28 12:07:59 CALLISTO pvestatd[1089]: storage 'pve-share' is not online

pve-share here is a shared storage mounted from a nas to store isos and container templates.

spirit · Jun 28, 2021

pvestatd[1335]: storage 'pve-share' is not online

This message is not related to corosync, it's like it can't reach your network storage.
(do you have multiple network links ? )

BTW, you shouldn't enable HA on a 2 nodes cluster. (seem that your have crm,lrm services running).

Mephisto aD · Jun 28, 2021

Hi spirit,

This message is not related to corosync, it's like it can't reach your network storage.
(do you have multiple network links ? )

no, there is just one network link, but the NAS simply is currently switched off.

BTW, you shouldn't enable HA on a 2 nodes cluster. (seem that your have crm,lrm services running).

can this be a possible cause for my issue?

and in case it is, how can i switch this off? without quorum i also can't edit HA options :-(

even changing the corosync.conf to give one node 2 votes temporary fails...

EDIT (at 1am):

Even booting into a life image and changing the corosync to have one node with two votes to restore quorum that way didn't work.

No idea if it would have been possible to remove the HA entries that way, but not knowing where they can be found i decided to reinstall both nodes and restore the backups.

Thanks for your help anyway

Maphisto

fiona · Jun 29, 2021

maphisto_ad said:
root@CALLISTO:~# journalctl -u corosync.service -b0
-- Logs begin at Sun 2021-06-27 20:46:01 CEST, end at Mon 2021-06-28 11:54:07 CEST. --
-- No entries --

Doesn't seem like the service got started at all on this node? Please try starting the service with systemctl start corosync.service and check again. systemctl status corosync.service might provide a bit of additional information.

Mephisto aD · Jun 29, 2021

Doesn't seem like the service got started at all on this node? Please try starting the service with systemctl start corosync.service and check again. systemctl status corosync.service might provide a bit of additional information.

I remember I did this with systemctl restart corosync.service, but it instantly failed again.

Unfortunately I cant get you any debug information anymore as I reinstalled last night and restored my backups after that. :-(
Currently there is just the docker LXC left to reinstall as i probably forgot to schedule automated backups on that.

Thanks a lot for your help anyway

Search

Search

Lost quorum while all nodes are online

Mephisto aD

New Member

fiona

Proxmox Staff Member

Mephisto aD

New Member

spirit

Distinguished Member

Mephisto aD

New Member

fiona

Proxmox Staff Member

Mephisto aD

New Member