Cluster quorum issues after upgrading from 3.4 to 4.0

dirk.nilius

Member
Nov 5, 2015
47
0
6
Berlin, Germany
On some cluster nodes there is a 50% chance to get a "TASK ERROR: cluster not ready - no quorum?" error. This never happens before the upgrade. I think this happens more often when the servers are under load. Maybe a too short timeout? After a litte time the sync works as expected. But the machines stays off.
 
You mean that this happens at the nodes start?
Can you show the syslog output from when that happens?

Interesting would be the log entries from pmxcfs and corosync?

At the moment we wait 10 seconds from the beginning of the 'startall' command for quorum (= a writable /etc/pve).
Normally that should be enough.
 
Hi Thomas,

I can say that under some curcumstances 10 seconds are far away from enough. When the other cluster nodes are under heavy load it took much longer.

Yes this happens at the node statup.

Here is the log:

Nov 5 01:30:03 ckc-b-p0003 lxc-devsetup[1879]: Creating /dev/.lxc
Nov 5 01:30:03 ckc-b-p0003 lxc-devsetup[1879]: /dev is devtmpfs
Nov 5 01:30:03 ckc-b-p0003 lxc-devsetup[1879]: Creating /dev/.lxc/user
Nov 5 01:30:03 ckc-b-p0003 iscsid: iSCSI daemon with pid=1627 started!
Nov 5 01:30:03 ckc-b-p0003 postfix[1756]: Starting Postfix Mail Transport Agent: postfix.
Nov 5 01:30:03 ckc-b-p0003 cron[1753]: (CRON) INFO (Running @reboot jobs)
Nov 5 01:30:03 ckc-b-p0003 postfix/master[1928]: daemon started -- version 2.11.3, configuration /etc/postfix
Nov 5 01:30:03 ckc-b-p0003 corosync[1945]: [MAIN ] Corosync Cluster Engine ('2.3.5'): started and ready to provide service.
Nov 5 01:30:03 ckc-b-p0003 corosync[1945]: [MAIN ] Corosync built-in features: augeas systemd pie relro bindnow
Nov 5 01:30:04 ckc-b-p0003 pve-firewall[1946]: starting server
Nov 5 01:30:04 ckc-b-p0003 pvestatd[1947]: starting server
Nov 5 01:30:04 ckc-b-p0003 kernel: [ 34.431134] ip_set: protocol 6
Nov 5 01:30:04 ckc-b-p0003 kernel: [ 34.529150] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
Nov 5 01:30:04 ckc-b-p0003 kernel: [ 34.641222] ip6_tables: (C) 2000-2006 Netfilter Core Team
Nov 5 01:30:04 ckc-b-p0003 kernel: [ 34.934661] bnx2 0000:02:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
Nov 5 01:30:04 ckc-b-p0003 kernel: [ 34.934674]
Nov 5 01:30:04 ckc-b-p0003 kernel: [ 34.934787] vmbr0: port 1(eth0) entered forwarding state
Nov 5 01:30:04 ckc-b-p0003 kernel: [ 34.934804] vmbr0: port 1(eth0) entered forwarding state
Nov 5 01:30:05 ckc-b-p0003 pvedaemon[1989]: starting server
Nov 5 01:30:05 ckc-b-p0003 pvedaemon[1989]: starting 3 worker(s)
Nov 5 01:30:05 ckc-b-p0003 pvedaemon[1989]: worker 1990 started
Nov 5 01:30:05 ckc-b-p0003 pvedaemon[1989]: worker 1991 started
Nov 5 01:30:05 ckc-b-p0003 pvedaemon[1989]: worker 1992 started
Nov 5 01:30:05 ckc-b-p0003 kernel: [ 35.193235] bnx2 0000:02:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
Nov 5 01:30:05 ckc-b-p0003 kernel: [ 35.193256] , receive & transmit flow control ON
Nov 5 01:30:05 ckc-b-p0003 kernel: [ 35.193381] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
Nov 5 01:30:06 ckc-b-p0003 pveproxy[2019]: starting server
Nov 5 01:30:06 ckc-b-p0003 pveproxy[2019]: starting 3 worker(s)
Nov 5 01:30:06 ckc-b-p0003 pveproxy[2019]: worker 2020 started
Nov 5 01:30:06 ckc-b-p0003 pveproxy[2019]: worker 2021 started
Nov 5 01:30:06 ckc-b-p0003 pveproxy[2019]: worker 2022 started
Nov 5 01:30:06 ckc-b-p0003 pve-manager[2024]: Starting VMs and Containers
Nov 5 01:30:06 ckc-b-p0003 spiceproxy[2030]: starting server
Nov 5 01:30:06 ckc-b-p0003 spiceproxy[2030]: starting 1 worker(s)
Nov 5 01:30:06 ckc-b-p0003 spiceproxy[2030]: worker 2031 started
Nov 5 01:30:06 ckc-b-p0003 pvesh: <root@pam> starting task UPID:ckc-b-p0003:000007F0:00000E59:563AA30E:startall::root@pam:
Nov 5 01:30:06 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:06 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:06 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:06 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:08 ckc-b-p0003 pmxcfs[1631]: [quorum] crit: quorum_initialize failed: 2
Nov 5 01:30:08 ckc-b-p0003 pmxcfs[1631]: [confdb] crit: cmap_initialize failed: 2
Nov 5 01:30:08 ckc-b-p0003 pmxcfs[1631]: [dcdb] crit: cpg_initialize failed: 2
Nov 5 01:30:08 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_initialize failed: 2
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [quorum] crit: quorum_initialize failed: 2
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [confdb] crit: cmap_initialize failed: 2
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [dcdb] crit: cpg_initialize failed: 2
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_initialize failed: 2
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:14 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:16 ckc-b-p0003 task UPID:ckc-b-p0003:000007F0:00000E59:563AA30E:startall::root@pam:: cluster not ready - no quorum?
Nov 5 01:30:16 ckc-b-p0003 pve-manager[2024]: cluster not ready - no quorum?
Nov 5 01:30:16 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:16 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:16 ckc-b-p0003 pvesh: <root@pam> end task UPID:ckc-b-p0003:000007F0:00000E59:563AA30E:startall::root@pam: cluster not ready - no quorum?
Nov 5 01:30:16 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:16 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_send_message failed: 9
Nov 5 01:30:20 ckc-b-p0003 pmxcfs[1631]: [quorum] crit: quorum_initialize failed: 2
Nov 5 01:30:20 ckc-b-p0003 pmxcfs[1631]: [confdb] crit: cmap_initialize failed: 2
Nov 5 01:30:20 ckc-b-p0003 pmxcfs[1631]: [dcdb] crit: cpg_initialize failed: 2
Nov 5 01:30:20 ckc-b-p0003 pmxcfs[1631]: [status] crit: cpg_initialize failed: 2
Nov 5 01:30:23 ckc-b-p0003 corosync[2080]: [TOTEM ] Initializing transport (UDP/IP Multicast).
Nov 5 01:30:23 ckc-b-p0003 corosync[2080]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [TOTEM ] The network interface [192.168.13.12] is now up.
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [SERV ] Service engine loaded: corosync configuration map access [0]
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [QB ] server name: cmap
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [SERV ] Service engine loaded: corosync configuration service [1]
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [QB ] server name: cfg
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [QB ] server name: cpg
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [SERV ] Service engine loaded: corosync profile loading service [4]
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [QUORUM] Using quorum provider corosync_votequorum
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [QB ] server name: votequorum
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [QB ] server name: quorum
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [TOTEM ] A new membership (192.168.13.12:5044) was formed. Members joined: 2
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [QUORUM] Members[1]: 2
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [TOTEM ] A new membership (192.168.13.12:5068) was formed. Members joined: 3 1
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [QUORUM] This node is within the primary component and will provide service.
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [QUORUM] Members[3]: 2 3 1
Nov 5 01:30:24 ckc-b-p0003 corosync[2080]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 5 01:30:24 ckc-b-p0003 corosync[1939]: Starting Corosync Cluster Engine (corosync): [ OK ]
 
Ok you're correct, I sent a fix which increases the wait time to 60 seconds, that should be enough for all cases.

If you need more time, tune the services causing the load so that they start after the PVE services.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!