Losing quorum

gijsbert · Oct 17, 2019

I have a running cluster (5.4-13) with 11 nodes and everything is up and running until I reboot one of the nodes. Attached is the corosync.conf and the pvecm status. Everything looks OK. However, when I reboot one of the nodes, I am losing quorum on a random other node once the rebooted node is coming back online. There are 2 different type of error messages that can occure on the running node that looses quorum:
======
ERROR1
======

Oct 17 10:19:12 pmnode12 corosync[10721]: error [TOTEM ] FAILED TO RECEIVE
Oct 17 10:19:12 pmnode12 corosync[10721]: [TOTEM ] FAILED TO RECEIVE
Oct 17 10:19:21 pmnode12 corosync[10721]: notice [TOTEM ] A new membership (213.132.140.97:86456) was formed. Members left: 8 13 9 4 5 1 15 14 6 3
Oct 17 10:19:21 pmnode12 corosync[10721]: notice [TOTEM ] Failed to receive the leave message. failed: 8 13 9 4 5 1 15 14 6 3
Oct 17 10:19:21 pmnode12 corosync[10721]: warning [CPG ] downlist left_list: 10 received
Oct 17 10:19:21 pmnode12 corosync[10721]: [TOTEM ] A new membership (213.132.140.97:86456) was formed. Members left: 8 13 9 4 5 1 15 14 6 3
Oct 17 10:19:21 pmnode12 corosync[10721]: [TOTEM ] Failed to receive the leave message. failed: 8 13 9 4 5 1 15 14 6 3
Oct 17 10:19:21 pmnode12 corosync[10721]: notice [QUORUM] This node is within the non-primary component and will NOT provide any services.

======
ERROR2
======
Oct 17 10:27:57 pmnode13 corosync[28530]: [TOTEM ] Retransmit List: 2 3 4 c 5 6 7 8 9 a b
Oct 17 10:27:57 pmnode13 corosync[28530]: error [TOTEM ] FAILED TO RECEIVE
Oct 17 10:27:57 pmnode13 corosync[28530]: [TOTEM ] FAILED TO RECEIVE
Oct 17 10:27:57 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 50
Oct 17 10:27:58 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 70
Oct 17 10:27:58 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 60
Oct 17 10:27:59 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 80
Oct 17 10:27:59 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 70
Oct 17 10:28:00 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 90
Oct 17 10:28:00 pmnode13 systemd[1]: Starting Proxmox VE replication runner...
Oct 17 10:28:01 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 80
Oct 17 10:28:01 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 100
Oct 17 10:28:01 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retried 100 times
Oct 17 10:28:01 pmnode13 pmxcfs[645]: [dcdb] crit: cpg_send_message failed: 6
Oct 17 10:28:01 pmnode13 CRON[10771]: (root) CMD (cd /tmp && iostat -xkd 30 2 | sed 's/,/\./g' > io.tmp && sleep 1 && mv io.tmp iostat.cache 2>/dev/null)
Oct 17 10:28:02 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 90
Oct 17 10:28:02 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 10
Oct 17 10:28:03 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 100
Oct 17 10:28:03 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retried 100 times
Oct 17 10:28:03 pmnode13 pmxcfs[645]: [status] crit: cpg_send_message failed: 6
Oct 17 10:28:03 pmnode13 pve-firewall[2295]: firewall update time (7.641 seconds)
Oct 17 10:28:03 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 20
Oct 17 10:28:04 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 10
Oct 17 10:28:04 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 30
Oct 17 10:28:05 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 20
Oct 17 10:28:05 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retry 40
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retry 30
Oct 17 10:28:06 pmnode13 corosync[28530]: notice [TOTEM ] A new membership (213.132.140.98:86600) was formed. Members left: 8 2 9 4 5 1 15 14 6 3
Oct 17 10:28:06 pmnode13 corosync[28530]: notice [TOTEM ] Failed to receive the leave message. failed: 8 2 9 4 5 1 15 14 6 3
Oct 17 10:28:06 pmnode13 corosync[28530]: [TOTEM ] A new membership (213.132.140.98:86600) was formed. Members left: 8 2 9 4 5 1 15 14 6 3
Oct 17 10:28:06 pmnode13 corosync[28530]: [TOTEM ] Failed to receive the leave message. failed: 8 2 9 4 5 1 15 14 6 3
Oct 17 10:28:06 pmnode13 corosync[28530]: warning [CPG ] downlist left_list: 9 received
Oct 17 10:28:06 pmnode13 corosync[28530]: [CPG ] downlist left_list: 9 received
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] notice: members: 13/645
Oct 17 10:28:06 pmnode13 corosync[28530]: notice [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 17 10:28:06 pmnode13 corosync[28530]: notice [QUORUM] Members[1]: 13
Oct 17 10:28:06 pmnode13 corosync[28530]: notice [MAIN ] Completed service synchronization, ready to provide service.
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: members: 13/645
Oct 17 10:28:06 pmnode13 corosync[28530]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 17 10:28:06 pmnode13 corosync[28530]: [QUORUM] Members[1]: 13
Oct 17 10:28:06 pmnode13 corosync[28530]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: node lost quorum
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] notice: cpg_send_message retried 46 times
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] crit: received write while not quorate - trigger resync
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] crit: leaving CPG group
Oct 17 10:28:06 pmnode13 pve-ha-lrm[31006]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pmnode13/lrm_status.tmp.31006' - Permission denied
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: cpg_send_message retried 31 times
Oct 17 10:28:06 pmnode13 pvesr[10763]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] notice: start cluster connection
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] notice: members: 13/645
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [dcdb] notice: all data is up to date
Oct 17 10:28:06 pmnode13 pvestatd[757]: status update time (13.220 seconds)
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/222: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/108: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/123: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/157: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/181: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/198: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/189: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/111: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/117: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/101: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/217: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmnode13/VM-backups-backup12: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmnode13/VM-backups-backup11: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmnode13/local: -1
Oct 17 10:28:06 pmnode13 pmxcfs[645]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmnode13/VM-backups-backup17: -1
Oct 17 10:28:07 pmnode13 pvesr[10763]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 17 10:28:15 pmnode13 pvesr[10763]: error with cfs lock 'file-replication_cfg': no quorum!
Oct 17 10:28:15 pmnode13 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Oct 17 10:28:15 pmnode13 systemd[1]: Failed to start Proxmox VE replication runner.
Oct 17 10:28:15 pmnode13 systemd[1]: pvesr.service: Unit entered failed state.
Oct 17 10:28:15 pmnode13 systemd[1]: pvesr.service: Failed with result 'exit-code'.

Once one of the above error occurs, the node takes a completely different ring ID and only have 1 vote: Most of the time, the cluster recover automatically after a few minutes, but sometimes I have to restart corosync to get things back in sync.

I have no idea what causes or triggers the above errors but I like to get this issue solved. I'm a little afraid one day I will be unable to start VM's on a node that enters some kind of failed state.

Kind regards and thanks in advance for your reply,

Gijsbert

spirit · Oct 17, 2019

do you have igmp snooping enabled on your physical switch, and if yes, do you have an igmp querier on your network ?

gijsbert · Oct 17, 2019

all nodes are installed in a single rack and the physical switch acts as a igmp querier with igmp snooping enabled

gijsbert · Oct 17, 2019

I am doing some further troubleshooting, with omping (on all nodes) I see the following:

virt011 : waiting for response msg
virt012 : waiting for response msg
virt013 : waiting for response msg
virt014 : waiting for response msg
virt016 : waiting for response msg
virt017 : waiting for response msg
virt027 : waiting for response msg
virt028 : waiting for response msg
virt029 : waiting for response msg
virt030 : waiting for response msg
virt016 : joined (S,G) = (*, 232.43.211.234), pinging
virt017 : joined (S,G) = (*, 232.43.211.234), pinging
virt029 : joined (S,G) = (*, 232.43.211.234), pinging
virt011 : joined (S,G) = (*, 232.43.211.234), pinging
virt027 : joined (S,G) = (*, 232.43.211.234), pinging
virt028 : joined (S,G) = (*, 232.43.211.234), pinging
virt012 : waiting for response msg
virt013 : waiting for response msg
virt014 : waiting for response msg
virt030 : waiting for response msg
virt013 : joined (S,G) = (*, 232.43.211.234), pinging
virt014 : joined (S,G) = (*, 232.43.211.234), pinging
virt030 : joined (S,G) = (*, 232.43.211.234), pinging
virt012 : waiting for response msg
virt012 : waiting for response msg
virt012 : joined (S,G) = (*, 232.43.211.234), pinging
virt012 : waiting for response msg
virt012 : server told us to stop
virt016 : given amount of query messages was sent
virt017 : given amount of query messages was sent
virt029 : given amount of query messages was sent
virt011 : given amount of query messages was sent
virt027 : given amount of query messages was sent
virt028 : given amount of query messages was sent
virt013 : given amount of query messages was sent
virt014 : given amount of query messages was sent
virt030 : given amount of query messages was sent

virt011 : unicast, xmt/rcv/%loss = 10000/7547/24%, min/avg/max/std-dev = 0.038/9.909/123.537/6.004
virt011 : multicast, xmt/rcv/%loss = 10000/7528/24%, min/avg/max/std-dev = 0.084/9.683/17.545/3.077
virt012 : unicast, xmt/rcv/%loss = 5896/2748/53%, min/avg/max/std-dev = 61.278/70.859/91.684/4.967
virt012 : multicast, xmt/rcv/%loss = 5896/2737/53%, min/avg/max/std-dev = 61.336/70.925/91.733/4.975
virt013 : unicast, xmt/rcv/%loss = 10000/6088/39%, min/avg/max/std-dev = 0.041/17.577/759.345/48.641
virt013 : multicast, xmt/rcv/%loss = 10000/6080/39%, min/avg/max/std-dev = 0.092/17.614/759.396/48.673
virt014 : unicast, xmt/rcv/%loss = 10000/9211/7%, min/avg/max/std-dev = 0.032/6.715/41.390/4.022
virt014 : multicast, xmt/rcv/%loss = 10000/9205/7%, min/avg/max/std-dev = 0.067/6.739/41.437/4.020
virt016 : unicast, xmt/rcv/%loss = 10000/9143/8%, min/avg/max/std-dev = 0.033/2.295/878.470/42.630
virt016 : multicast, xmt/rcv/%loss = 10000/9137/8%, min/avg/max/std-dev = 0.040/2.302/878.472/42.639
virt017 : unicast, xmt/rcv/%loss = 10000/9999/0%, min/avg/max/std-dev = 0.034/0.084/0.237/0.019
virt017 : multicast, xmt/rcv/%loss = 10000/9991/0%, min/avg/max/std-dev = 0.039/0.088/0.281/0.020
virt027 : unicast, xmt/rcv/%loss = 10000/6469/35%, min/avg/max/std-dev = 13.903/42.362/1001.938/45.443
virt027 : multicast, xmt/rcv/%loss = 10000/6451/35% (seq>=2 35%), min/avg/max/std-dev = 13.906/42.406/1001.986/45.504
virt028 : unicast, xmt/rcv/%loss = 10000/2071/79%, min/avg/max/std-dev = 100.519/124.422/1104.786/104.342
virt028 : multicast, xmt/rcv/%loss = 10000/2065/79%, min/avg/max/std-dev = 100.571/124.554/1104.839/104.492
virt029 : unicast, xmt/rcv/%loss = 10000/9198/8%, min/avg/max/std-dev = 0.038/5.531/41.890/8.206
virt029 : multicast, xmt/rcv/%loss = 10000/9192/8%, min/avg/max/std-dev = 0.044/5.555/41.937/8.212
virt030 : unicast, xmt/rcv/%loss = 10000/3785/62%, min/avg/max/std-dev = 63.067/73.659/1235.848/69.285
virt030 : multicast, xmt/rcv/%loss = 10000/3785/62%, min/avg/max/std-dev = 63.110/73.707/1235.894/69.284

In my firewall however, the multicast-address is set to 239.192.150.125 so I have no idea where this 232.43.211.234 is coming from. Anyone?

Gijsbert

spirit · Oct 17, 2019

>>In my firewall however, the multicast-address is set to 239.192.150.125 so I have no idea where this 232.43.211.234 is coming from. Anyone?

the multicast address for corosync is generated from the corosync clustername.

you can find it with:

# corosync-cmapctl |grep mcastaddr

gijsbert · Oct 18, 2019

spirit said:
>>In my firewall however, the multicast-address is set to 239.192.150.125 so I have no idea where this 232.43.211.234 is coming from. Anyone?

the multicast address for corosync is generated from the corosync clustername.

you can find it with:

# corosync-cmapctl |grep mcastaddr

Thanks spirit

totem.interface.0.mcastaddr (str) = 239.192.150.125

If I do an "omping -m 239.192.150.125" on all nodes, it looks all OK, on each node I see things like:

pmnode12 : unicast, xmt/rcv/%loss = 9/9/0%, min/avg/max/std-dev = 0.640/1.462/3.053/0.759
pmnode12 : multicast, xmt/rcv/%loss = 9/9/0%, min/avg/max/std-dev = 0.718/1.526/3.102/0.760

However, when I do an "omping -c 10000 -i 0.001 -F -q" to all nodes I see a lot of packet loss:

pmnode12 : unicast, xmt/rcv/%loss = 10000/2619/73%, min/avg/max/std-dev = 8.272/83.967/96.724/5.775
pmnode12 : multicast, xmt/rcv/%loss = 10000/2328/76%, min/avg/max/std-dev = 8.360/83.910/96.510/5.939

And I see messages in syslog like:

- pmnode12 : waiting for response msg
- pmnode12 : joined (S,G) = (*, 232.43.211.234), pinging
- pmnode13 : server told us to stop
- pmnode13 : given amount of query messages was sent

1) Where is 232.43.211.234 coming from?

Most important question/thought for now: The routers and switches are not maintained by me, so I don't have access and I rely on my ISP. Isn't there a way to maintain/manage multicast behaviour from within the cluster / proxmox-software, so the cluster operates independent from ISP/router/switches in the network? If this is not possible or complex, what to ask my ISP?

Gijsbert

spirit · Oct 18, 2019

It's really not recommended to run corosync on high latency link or public network. (and generally isp block multicast).
with latency of 1-2s (without counting packetloss :/), I think you can do 4-5 nodes cluster max.

with 50-80ms latency, forget about it ,i'll never work.

gijsbert · Oct 18, 2019

spirit said:
It's really not recommended to run corosync on high latency link or public network. (and generally isp block multicast).
with latency of 1-2s (without counting packetloss :/), I think you can do 4-5 nodes cluster max.

with 50-80ms latency, forget about it ,i'll never work.

So what is your advice? Separate multicast traffic to another (internal) network over eth1 in stead of public eth0? My network-config, all on eth0:

auto lo
iface lo inet loopback

iface eth0 inet manual
iface eth1 inet manual

auto vmbr0
iface vmbr0 inet static
address 213.132.140.96
netmask 255.255.255.0
gateway 213.132.140.1
bridge_ports eth0
bridge_stp off
bridge_fd 0

iface vmbr0 inet6 static
address 2a03:8b80:a:aa::11
netmask 48
gateway 2a03:8b80:a::1

auto vmbr0.102
iface vmbr0.102 inet static
address 172.17.2.11
netmask 255.255.0.0
broadcast 172.17.255.255
network 172.17.0.0
vlan_raw_device vbmr0

iface vmbr0.102 inet6 static
address fd00:517e:b17e::211
netmask 64

spirit · Oct 18, 2019

gijsbert said:
So what is your advice? Separate multicast traffic to another (internal) network over eth1 in stead of public eth0? My network-config, all on eth0:

yes, if you have an internal network available, use it !

Search

Search

Losing quorum

gijsbert

Active Member

Attachments

spirit

Distinguished Member

gijsbert

Active Member

gijsbert

Active Member

spirit

Distinguished Member

gijsbert

Active Member

spirit

Distinguished Member

gijsbert

Active Member

spirit

Distinguished Member