Hello,
I have 3 nodes cluster with iSCSI SAN.
After update to 3.4 I have problems with cluster.
After startting nodes everythig looks ok.
I start VMs and cluster steel works, I try restart some VMs and cluster breake down.
In logs of one node I have:
Apr 7 10:42:39 node13 corosync[6379]: [TOTEM ] Retransmit List: 2c17 2c18 2c2c 2c22 2c23 2c24 2c25 2c26 2c27 2c28 2c29 2c2a 2c2b 2c2d 2c3b 2c19 2c1a 2c1b 2c1c 2c1d 2c1e 2c1f 2c20 2c21 2c35 2c36 2c37 2c38 2c39 2c3a
Apr 7 10:42:39 node13 corosync[6379]: [TOTEM ] Retransmit List: 2c38 2c39 2c3a 2c2e 2c2f 2c30 2c22 2c23 2c24 2c25 2c26 2c27 2c28 2c29 2c2a 2c2b 2c31 2c32 2c19 2c1a 2c1b 2c1c 2c1d 2c1e 2c1f 2c20 2c21 2c35 2c36 2c37
Apr 7 10:42:39 node13 corosync[6379]: [TOTEM ] Retransmit List: 2c17 2c18 2c22 2c23 2c24 2c25 2c26 2c27 2c28 2c29 2c2a 2c2b 2c2c 2c2d 2c33 2c34 2c3b 2c19 2c1a 2c1b 2c1c 2c1d 2c1e 2c1f 2c20 2c21 2c35 2c36 2c37 2c38
Apr 7 10:42:39 node13 corosync[6379]: [TOTEM ] Retransmit List: 2c37 2c38 2c2f 2c30 2c22 2c23 2c24 2c25 2c26 2c27 2c28 2c29 2c2a 2c2b 2c31 2c19 2c1a 2c1b 2c1c 2c1d 2c1e 2c1f 2c20 2c21 2c35 2c36 2c39 2c3a 2c3b 2c3c
Apr 7 10:42:39 node13 corosync[6379]: [TOTEM ] FAILED TO RECEIVE
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] CLM CONFIGURATION CHANGE
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] New Configuration:
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] #011r(0) ip(x.y.z.13)
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] Members Left:
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] #011r(0) ip(x.y.z.11)
Apr 7 10:42:51 node1corosync[6379]: [CLM ] #011r(0) ip(x.y.z.12)
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] Members Joined:
Apr 7 10:42:51 node13 pmxcfs[4576]: [status] notice: node lost quorum
Apr 7 10:42:51 node13 corosync[6379]: [QUORUM] Members[2]: 2 3
Apr 7 10:42:51 node13 corosync[6379]: [CMAN ] quorum lost, blocking activity
Apr 7 10:42:51 node13 corosync[6379]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Apr 7 10:42:51 node13 corosync[6379]: [QUORUM] Members[1]: 3
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] CLM CONFIGURATION CHANGE
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] New Configuration:
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] #011r(0) ip(x.y.z.13)
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] Members Left:
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] Members Joined:
Apr 7 10:42:51 node13 corosync[6379]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr 7 10:42:51 node13 rgmanager[8505]: #1: Quorum Dissolved
Apr 7 10:42:51 node13 dlm_controld[8229]: node_history_cluster_remove no nodeid 1
Apr 7 10:42:51 node13 corosync[6379]: [CPG ] chosen downlist: sender r(0) ip(x.y.z.13) ; members(old:3 left:2)
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: members: 2/4674, 3/4576
Apr 7 10:42:51 node13 kernel: dlm: closing connection to node 1
Apr 7 10:42:51 node13 kernel: dlm: closing connection to node 2
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: starting data syncronisation
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: members: 3/4576
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: all data is up to date
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: members: 2/4674, 3/4576
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: starting data syncronisation
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: members: 3/4576
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: all data is up to date
Apr 7 10:42:51 node13 corosync[6379]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 7 10:42:51 node13 pmxcfs[4576]: [status] notice: cpg_send_message retried 1 Times
On all nodes I have:
Version: 6.2.0
Config Version: 17
Cluster Name: OLIMP
Cluster Id: 2398
Cluster Member: Yes
Cluster Generation: 22336
Membership state: Cluster-Member
Nodes: 1
Expected votes: 3
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 6
Flags:
Ports Bound: 0 177
Node name: zeus13
Node ID: 3
Multicast addresses: 239.192.9.103
Node addresses: x.y.z.13
This is system in production, so now everything works without quorum and cluster I need this VMs.
Have You some idea what to do?
How to repair cluster with working VMs and without restart of nodes?
Thanks in advance.
I have 3 nodes cluster with iSCSI SAN.
After update to 3.4 I have problems with cluster.
After startting nodes everythig looks ok.
I start VMs and cluster steel works, I try restart some VMs and cluster breake down.
In logs of one node I have:
Apr 7 10:42:39 node13 corosync[6379]: [TOTEM ] Retransmit List: 2c17 2c18 2c2c 2c22 2c23 2c24 2c25 2c26 2c27 2c28 2c29 2c2a 2c2b 2c2d 2c3b 2c19 2c1a 2c1b 2c1c 2c1d 2c1e 2c1f 2c20 2c21 2c35 2c36 2c37 2c38 2c39 2c3a
Apr 7 10:42:39 node13 corosync[6379]: [TOTEM ] Retransmit List: 2c38 2c39 2c3a 2c2e 2c2f 2c30 2c22 2c23 2c24 2c25 2c26 2c27 2c28 2c29 2c2a 2c2b 2c31 2c32 2c19 2c1a 2c1b 2c1c 2c1d 2c1e 2c1f 2c20 2c21 2c35 2c36 2c37
Apr 7 10:42:39 node13 corosync[6379]: [TOTEM ] Retransmit List: 2c17 2c18 2c22 2c23 2c24 2c25 2c26 2c27 2c28 2c29 2c2a 2c2b 2c2c 2c2d 2c33 2c34 2c3b 2c19 2c1a 2c1b 2c1c 2c1d 2c1e 2c1f 2c20 2c21 2c35 2c36 2c37 2c38
Apr 7 10:42:39 node13 corosync[6379]: [TOTEM ] Retransmit List: 2c37 2c38 2c2f 2c30 2c22 2c23 2c24 2c25 2c26 2c27 2c28 2c29 2c2a 2c2b 2c31 2c19 2c1a 2c1b 2c1c 2c1d 2c1e 2c1f 2c20 2c21 2c35 2c36 2c39 2c3a 2c3b 2c3c
Apr 7 10:42:39 node13 corosync[6379]: [TOTEM ] FAILED TO RECEIVE
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] CLM CONFIGURATION CHANGE
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] New Configuration:
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] #011r(0) ip(x.y.z.13)
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] Members Left:
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] #011r(0) ip(x.y.z.11)
Apr 7 10:42:51 node1corosync[6379]: [CLM ] #011r(0) ip(x.y.z.12)
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] Members Joined:
Apr 7 10:42:51 node13 pmxcfs[4576]: [status] notice: node lost quorum
Apr 7 10:42:51 node13 corosync[6379]: [QUORUM] Members[2]: 2 3
Apr 7 10:42:51 node13 corosync[6379]: [CMAN ] quorum lost, blocking activity
Apr 7 10:42:51 node13 corosync[6379]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Apr 7 10:42:51 node13 corosync[6379]: [QUORUM] Members[1]: 3
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] CLM CONFIGURATION CHANGE
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] New Configuration:
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] #011r(0) ip(x.y.z.13)
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] Members Left:
Apr 7 10:42:51 node13 corosync[6379]: [CLM ] Members Joined:
Apr 7 10:42:51 node13 corosync[6379]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr 7 10:42:51 node13 rgmanager[8505]: #1: Quorum Dissolved
Apr 7 10:42:51 node13 dlm_controld[8229]: node_history_cluster_remove no nodeid 1
Apr 7 10:42:51 node13 corosync[6379]: [CPG ] chosen downlist: sender r(0) ip(x.y.z.13) ; members(old:3 left:2)
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: members: 2/4674, 3/4576
Apr 7 10:42:51 node13 kernel: dlm: closing connection to node 1
Apr 7 10:42:51 node13 kernel: dlm: closing connection to node 2
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: starting data syncronisation
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: members: 3/4576
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: all data is up to date
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: members: 2/4674, 3/4576
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: starting data syncronisation
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: members: 3/4576
Apr 7 10:42:51 node13 pmxcfs[4576]: [dcdb] notice: all data is up to date
Apr 7 10:42:51 node13 corosync[6379]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 7 10:42:51 node13 pmxcfs[4576]: [status] notice: cpg_send_message retried 1 Times
On all nodes I have:
Version: 6.2.0
Config Version: 17
Cluster Name: OLIMP
Cluster Id: 2398
Cluster Member: Yes
Cluster Generation: 22336
Membership state: Cluster-Member
Nodes: 1
Expected votes: 3
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 6
Flags:
Ports Bound: 0 177
Node name: zeus13
Node ID: 3
Multicast addresses: 239.192.9.103
Node addresses: x.y.z.13
This is system in production, so now everything works without quorum and cluster I need this VMs.
Have You some idea what to do?
How to repair cluster with working VMs and without restart of nodes?
Thanks in advance.