Hi,
What i can see from the logs so is it something very strange that is going on with the cluster under heavy load (for a long amount of time, it could take about 3-4 days after the load was ended to the problem start to exist in the cluster.
I can not found any drops on the interface that indicate that this should be the problem.
vmbr0 Link encap:Ethernet HWaddr ec:f4:bb:e7:f2:5e
inet addr:10.10.13.102 Bcast:10.10.13.255 Mask:255.255.255.0
inet6 addr: fe80::eef4:bbff:fee7:f25e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:99047944 errors:0 dropped:0 overruns:0 frame:0
TX packets:86666544 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
eth1 Link encap:Ethernet HWaddr ec:f4:bb:e7:f2:5e
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:107024643 errors:0 dropped:20581 overruns:0 frame:0
TX packets:93764407 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:32182262155 (29.9 GiB) TX bytes:16450322190 (15.3 GiB)
On all nodes i could find this:
The clusters have in total 13 members, and it also have a dedicated 10G interface for the internal communications between all nodes. (the same port is also using a NFS storage directly added to the LXC cointainers)
pvecm status
Quorum information
------------------
Date: Fri Mar 8 15:29:25 2019
Quorum provider: corosync_votequorum
Nodes: 13
Node ID: 0x00000002
Ring ID: 2/178324
Quorate: Yes
Votequorum information
----------------------
Expected votes: 13
Highest expected: 13
Total votes: 13
Quorum: 7
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000002 1 10.10.13.102 (local)
0x00000003 1 10.10.13.103
0x00000005 1 10.10.13.104
0x00000007 1 10.10.13.105
0x00000006 1 10.10.13.106
0x00000008 1 10.10.13.107
0x00000009 1 10.10.13.108
0x0000000a 1 10.10.13.109
0x0000000b 1 10.10.13.110
0x0000000c 1 10.10.13.111
0x0000000d 1 10.10.13.112
0x0000000e 1 10.10.13.113
0x00000004 1 10.10.13.117
I have also checked that the multicast is working OK on all nodes.
I have read about someone say that we should add more token time to the corosync config to prevent this to happen, but is this the solution ?
What i can see from the logs so is it something very strange that is going on with the cluster under heavy load (for a long amount of time, it could take about 3-4 days after the load was ended to the problem start to exist in the cluster.
I can not found any drops on the interface that indicate that this should be the problem.
vmbr0 Link encap:Ethernet HWaddr ec:f4:bb:e7:f2:5e
inet addr:10.10.13.102 Bcast:10.10.13.255 Mask:255.255.255.0
inet6 addr: fe80::eef4:bbff:fee7:f25e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:99047944 errors:0 dropped:0 overruns:0 frame:0
TX packets:86666544 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
eth1 Link encap:Ethernet HWaddr ec:f4:bb:e7:f2:5e
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:107024643 errors:0 dropped:20581 overruns:0 frame:0
TX packets:93764407 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:32182262155 (29.9 GiB) TX bytes:16450322190 (15.3 GiB)
On all nodes i could find this:
Code:
Mar 05 11:40:57 tc01-c-h corosync[2065]: [TOTEM ] A processor failed, forming new configuration.
Mar 05 11:40:58 tc01-c-h corosync[2065]: [TOTEM ] A new membership (10.10.13.102:178312) was formed. Members
Mar 05 11:40:58 tc01-c-h corosync[2065]: [QUORUM] Members[13]: 2 3 5 7 6 8 9 10 11 12 13 14 4
Mar 05 11:40:58 tc01-c-h corosync[2065]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 06 03:06:20 tc01-c-h corosync[2065]: [TOTEM ] A processor failed, forming new configuration.
Mar 06 03:06:20 tc01-c-h corosync[2065]: [TOTEM ] A new membership (10.10.13.102:178316) was formed. Members
Mar 06 03:06:20 tc01-c-h corosync[2065]: [QUORUM] Members[13]: 2 3 5 7 6 8 9 10 11 12 13 14 4
Mar 06 03:06:20 tc01-c-h corosync[2065]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 06 19:46:10 tc01-c-h corosync[2065]: [TOTEM ] A processor failed, forming new configuration.
Mar 06 19:46:10 tc01-c-h corosync[2065]: [TOTEM ] A new membership (10.10.13.102:178320) was formed. Members
Mar 06 19:46:10 tc01-c-h corosync[2065]: [QUORUM] Members[13]: 2 3 5 7 6 8 9 10 11 12 13 14 4
Mar 06 19:46:10 tc01-c-h corosync[2065]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 07 19:56:03 tc01-c-h corosync[2065]: [TOTEM ] A processor failed, forming new configuration.
Mar 07 19:56:03 tc01-c-h corosync[2065]: [TOTEM ] A new membership (10.10.13.102:178324) was formed. Members
Mar 07 19:56:03 tc01-c-h corosync[2065]: [QUORUM] Members[13]: 2 3 5 7 6 8 9 10 11 12 13 14 4
Mar 07 19:56:03 tc01-c-h corosync[2065]: [MAIN ] Completed service synchronization, ready to provide service.
The clusters have in total 13 members, and it also have a dedicated 10G interface for the internal communications between all nodes. (the same port is also using a NFS storage directly added to the LXC cointainers)
pvecm status
Quorum information
------------------
Date: Fri Mar 8 15:29:25 2019
Quorum provider: corosync_votequorum
Nodes: 13
Node ID: 0x00000002
Ring ID: 2/178324
Quorate: Yes
Votequorum information
----------------------
Expected votes: 13
Highest expected: 13
Total votes: 13
Quorum: 7
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000002 1 10.10.13.102 (local)
0x00000003 1 10.10.13.103
0x00000005 1 10.10.13.104
0x00000007 1 10.10.13.105
0x00000006 1 10.10.13.106
0x00000008 1 10.10.13.107
0x00000009 1 10.10.13.108
0x0000000a 1 10.10.13.109
0x0000000b 1 10.10.13.110
0x0000000c 1 10.10.13.111
0x0000000d 1 10.10.13.112
0x0000000e 1 10.10.13.113
0x00000004 1 10.10.13.117
I have also checked that the multicast is working OK on all nodes.
I have read about someone say that we should add more token time to the corosync config to prevent this to happen, but is this the solution ?
Last edited: