Problem with cluster after update to 3.4

mario · Apr 7, 2015

Hello,
I have 3 nodes cluster with iSCSI SAN.
After update to 3.4 I have problems with cluster.
After startting nodes everythig looks ok.
I start VMs and cluster steel works, I try restart some VMs and cluster breake down.
In logs of one node I have:
Apr 7 10:42:39 node13 corosync[6379]: [TOTEM ] Retransmit List: 2c17 2c18 2c2c 2c22 2c23 2c24 2c25 2c26 2c27 2c28 2c29 2c2a 2c2b 2c2d 2c3b 2c19 2c1a 2c1b 2c1c 2c1d 2c1e 2c1f 2c20 2c21 2c35 2c36 2c37 2c38 2c39 2c3a
Apr 7 10:42:39 node13 corosync[6379]: [TOTEM ] Retransmit List: 2c38 2c39 2c3a 2c2e 2c2f 2c30 2c22 2c23 2c24 2c25 2c26 2c27 2c28 2c29 2c2a 2c2b 2c31 2c32 2c19 2c1a 2c1b 2c1c 2c1d 2c1e 2c1f 2c20 2c21 2c35 2c36 2c37
Apr 7 10:42:39 node13 corosync[6379]: [TOTEM ] Retransmit List: 2c17 2c18 2c22 2c23 2c24 2c25 2c26 2c27 2c28 2c29 2c2a 2c2b 2c2c 2c2d 2c33 2c34 2c3b 2c19 2c1a 2c1b 2c1c 2c1d 2c1e 2c1f 2c20 2c21 2c35 2c36 2c37 2c38
Apr 7 10:42:39 node13 corosync[6379]: [TOTEM ] Retransmit List: 2c37 2c38 2c2f 2c30 2c22 2c23 2c24 2c25 2c26 2c27 2c28 2c29 2c2a 2c2b 2c31 2c19 2c1a 2c1b 2c1c 2c1d 2c1e 2c1f 2c20 2c21 2c35 2c36 2c39 2c3a 2c3b 2c3c
Apr 7 10:42:39 node13 corosync[6379]: [TOTEM ] FAILED TO RECEIVE
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] CLM CONFIGURATION CHANGE
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] New Configuration:
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] #011r(0) ip(x.y.z.13)
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] Members Left:
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] #011r(0) ip(x.y.z.11)
Apr 7 10:42:51 node1corosync[6379]: [CLM ] #011r(0) ip(x.y.z.12)
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] Members Joined:
Apr 7 10:42:51 node13 pmxcfs[4576]: [status] notice: node lost quorum
Apr 7 10:42:51 node13 corosync[6379]: [QUORUM] Members[2]: 2 3
Apr 7 10:42:51 node13 corosync[6379]: [CMAN ] quorum lost, blocking activity
Apr 7 10:42:51 node13 corosync[6379]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Apr 7 10:42:51 node13 corosync[6379]: [QUORUM] Members[1]: 3
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] CLM CONFIGURATION CHANGE
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] New Configuration:
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] #011r(0) ip(x.y.z.13)
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] Members Left:
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] Members Joined:
Apr 7 10:42:51 node13 corosync[6379]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr 7 10:42:51 node13 rgmanager[8505]: #1: Quorum Dissolved
Apr 7 10:42:51 node13 dlm_controld[8229]: node_history_cluster_remove no nodeid 1
Apr 7 10:42:51 node13 corosync[6379]: [CPG ] chosen downlist: sender r(0) ip(x.y.z.13) ; members(old:3 left:2)
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: members: 2/4674, 3/4576
Apr 7 10:42:51 node13 kernel: dlm: closing connection to node 1
Apr 7 10:42:51 node13 kernel: dlm: closing connection to node 2
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: starting data syncronisation
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: members: 3/4576
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: all data is up to date
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: members: 2/4674, 3/4576
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: starting data syncronisation
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: members: 3/4576
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: all data is up to date
Apr 7 10:42:51 node13 corosync[6379]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 7 10:42:51 node13 pmxcfs[4576]: [status] notice: cpg_send_message retried 1 Times

On all nodes I have:

Version: 6.2.0
Config Version: 17
Cluster Name: OLIMP
Cluster Id: 2398
Cluster Member: Yes
Cluster Generation: 22336
Membership state: Cluster-Member
Nodes: 1
Expected votes: 3
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 6
Flags:
Ports Bound: 0 177
Node name: zeus13
Node ID: 3
Multicast addresses: 239.192.9.103
Node addresses: x.y.z.13

This is system in production, so now everything works without quorum and cluster I need this VMs.
Have You some idea what to do?
How to repair cluster with working VMs and without restart of nodes?
Thanks in advance.

hansm · Apr 19, 2015

I have the same problem.Not sure when and why the problem started, because we also migrated our 1Gb network to 10Gb and therefore replaced switches and network cards. Around the same time we upgraded to PVE 3.4. Since 1 week our cluster crashed 2 times. That means quorum was last on all nodes, every node could only see itself. The first time we had to correct it by rebooting, because restarting services didn't work, cman couldn't restart because rgmanager was running, but stopping and even kill -9 of rgmanager didn't help. Rgmanager got in state after kill -9. Tonight we faced the same problem but now it corrected itself partly after 2 minutes.The "fun" is that VM's without HA enabled keep running and HA enabled VM's crashes. For HA enabled VM's you need a quorate cluster. so our not HA VM's are more HA than our HA enabled VM's. Sounds funny, but I really hate that.We tested multicast traffic and off course this works, we use bonding mode 1 (active-backup), no ports are shut on the switches, also nothing in the logs of the switches, no storm control or STP actions.

Code:

Apr 19 22:16:01 corosync [TOTEM ] Retransmit List: 10717Apr 19 22:16:01 corosync [TOTEM ] Retransmit List: 10716 10717Apr 19 22:16:01 corosync [TOTEM ] Retransmit List: 10716 10717Apr 19 22:16:01 corosync [TOTEM ] Retransmit List: 10716 10717Apr 19 22:16:01 corosync [TOTEM ] Retransmit List: 10716 10717Apr 19 22:16:01 corosync [TOTEM ] Retransmit List: 10716 10717Apr 19 22:16:01 corosync [TOTEM ] Retransmit List: 10716 10717Apr 19 22:16:01 corosync [TOTEM ] Retransmit List: 10716 10717Apr 19 22:16:01 corosync [TOTEM ] Retransmit List: 10716 10717...Apr 19 22:18:42 corosync [TOTEM ] Retransmit List: 10722 10726 10727 10730 1073b 10710 10711 10712 10713 10714 10715 10716 10717 10728 10729 1072a 1072b 1072cApr 19 22:18:42 corosync [TOTEM ] Retransmit List: 10723 10724 10725 10731 10710 10711 10712 10713 10714 10715 10716 10717 10728 10729 1072a 1072b 1072dApr 19 22:18:42 corosync [TOTEM ] Retransmit List: 10722 10726 10727 10730 1073a 10710 10711 10712 10713 10714 10715 10716 10717 10728 10729 1072a 1072bApr 19 22:18:42 corosync [TOTEM ] FAILED TO RECEIVEApr 19 22:18:54 corosync [CLM   ] CLM CONFIGURATION CHANGEApr 19 22:18:54 corosync [CLM   ] New Configuration:Apr 19 22:18:54 corosync [CLM   ]       r(0) ip(192.168.xxx.3)Apr 19 22:18:54 corosync [CLM   ] Members Left:Apr 19 22:18:54 corosync [CLM   ]       r(0) ip(192.168.xxx.1)Apr 19 22:18:54 corosync [CLM   ]       r(0) ip(192.168.xxx.2)Apr 19 22:18:54 corosync [CLM   ] Members Joined:Apr 19 22:18:54 corosync [QUORUM] Members[2]: 2 3Apr 19 22:18:54 corosync [CMAN  ] quorum lost, blocking activityApr 19 22:18:54 corosync [QUORUM] This node is within the non-primary component and will NOT provide any services....Apr 19 22:18:54 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.Apr 19 22:20:49 corosync [CLM   ] CLM CONFIGURATION CHANGEApr 19 22:20:49 corosync [CLM   ] New Configuration:Apr 19 22:20:49 corosync [CLM   ]       r(0) ip(192.168.xxx.3)Apr 19 22:20:49 corosync [CLM   ] Members Left:Apr 19 22:20:49 corosync [CLM   ] Members Joined:Apr 19 22:20:49 corosync [CLM   ] CLM CONFIGURATION CHANGEApr 19 22:20:49 corosync [CLM   ] New Configuration:Apr 19 22:20:49 corosync [CLM   ]       r(0) ip(192.168.xxx.1)Apr 19 22:20:49 corosync [CLM   ]       r(0) ip(192.168.xxx.2)Apr 19 22:20:49 corosync [CLM   ]       r(0) ip(192.168.xxx.3)Apr 19 22:20:49 corosync [CLM   ] Members Left:Apr 19 22:20:49 corosync [CLM   ] Members Joined:Apr 19 22:20:49 corosync [CLM   ]       r(0) ip(192.168.xxx.1)Apr 19 22:20:49 corosync [CLM   ]       r(0) ip(192.168.xxx.2)Apr 19 22:20:49 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.Apr 19 22:20:49 corosync [CMAN  ] quorum regained, resuming activityApr 19 22:20:49 corosync [QUORUM] This node is within the primary component and will provide service.

Code:

# pveversion -vproxmox-ve-2.6.32: 3.4-150 (running kernel: 2.6.32-37-pve)pve-manager: 3.4-3 (running version: 3.4-3/2fc72fee)pve-kernel-2.6.32-32-pve: 2.6.32-136pve-kernel-2.6.32-37-pve: 2.6.32-150

mario · Apr 20, 2015

Hi,
I fix my problem with cluster. We use two stackable switches Cisco SG500X.
I turn on: Bridge Multicast Filtering, IGMP Snooping and restart all 3 hosts and SAN - now it's working one week after update and changes in switch stack.

Search

Search

Problem with cluster after update to 3.4

mario

Active Member

hansm

Well-Known Member

mario

Active Member