Cluster Instability

XN-Matt · Sep 19, 2017

Last week something happened where nodes just started rebooting each other because I assume, they thought things were down so we removed all VMs from cluster.

This morning, a similar thing happened.

One node went down, then another and another. Initially, we thought this was due to conntrack saying it was full but that occurred on two nodes that didn't reboot (and limit has increased accordingly). Although that may be related?

Here are some logs from the first node that rebooted.

Any suggestions on further debug would be appreciated as twice in a week is confusion. (Switch is set correctly in regard to multicast/IGMP snooping).

Sep 19 07:11:41 c1-h7-i corosync[1106]: notice [TOTEM ] A new membership (10.0.0.14:1056) was formed. Members left: 1 5
Sep 19 07:11:41 c1-h7-i corosync[1106]: notice [TOTEM ] Failed to receive the leave message. failed: 1 5
Sep 19 07:11:41 c1-h7-i corosync[1106]: [TOTEM ] A new membership (10.0.0.14:1056) was formed. Members left: 1 5
Sep 19 07:11:41 c1-h7-i corosync[1106]: [TOTEM ] Failed to receive the leave message. failed: 1 5
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: members: 2/1105, 3/1093, 4/1038, 6/1062, 7/1045, 8/1068, 9/1049
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: starting data syncronisation
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: members: 2/1105, 3/1093, 4/1038, 6/1062, 7/1045, 8/1068, 9/1049
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: starting data syncronisation
Sep 19 07:11:41 c1-h7-i corosync[1106]: notice [QUORUM] Members[7]: 4 6 7 8 9 2 3
Sep 19 07:11:41 c1-h7-i corosync[1106]: notice [MAIN ] Completed service synchronization, ready to provide service.
Sep 19 07:11:41 c1-h7-i corosync[1106]: [QUORUM] Members[7]: 4 6 7 8 9 2 3
Sep 19 07:11:41 c1-h7-i corosync[1106]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: received sync request (epoch 2/1105/00000027)
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: received sync request (epoch 2/1105/00000022)
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: received all states
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: leader is 2/1105
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: synced members: 2/1105, 3/1093, 4/1038, 6/1062, 7/1045, 8/1068, 9/1049
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: all data is up to date
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: dfsm_deliver_queue: queue length 20
Sep 19 07:11:41 c1-h7-i pve-ha-crm[1168]: loop take too long (33 seconds)
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: received all states
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: all data is up to date
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: dfsm_deliver_queue: queue length 330
Sep 19 07:11:46 c1-h7-i corosync[1106]: warning [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Sep 19 07:11:46 c1-h7-i corosync[1106]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Sep 19 07:11:51 c1-h7-i pve-ha-lrm[1179]: loop take too long (34 seconds)
Sep 19 07:11:57 c1-h7-i corosync[1106]: warning [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Sep 19 07:11:57 c1-h7-i corosync[1106]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Sep 19 07:11:58 c1-h7-i corosync[1106]: notice [TOTEM ] A new membership (10.0.0.14:1072) was formed. Members joined: 1
Sep 19 07:11:58 c1-h7-i corosync[1106]: [TOTEM ] A new membership (10.0.0.14:1072) was formed. Members joined: 1
Sep 19 07:11:58 c1-h7-i pmxcfs[1093]: [status] notice: members: 1/1076, 2/1105, 3/1093, 4/1038, 6/1062, 7/1045, 8/1068, 9/1049
Sep 19 07:11:58 c1-h7-i pmxcfs[1093]: [status] notice: starting data syncronisation
Sep 19 07:11:58 c1-h7-i corosync[1106]: notice [QUORUM] Members[8]: 4 6 7 8 9 2 3 1
Sep 19 07:11:58 c1-h7-i corosync[1106]: notice [MAIN ] Completed service synchronization, ready to provide service.

Alwin · Sep 19, 2017

Did you test if your multicast setup is working properly? It looks like, something is disrupting the corosync traffic.
https://pve.proxmox.com/wiki/Multicast_notes#Using_omping_to_test_multicast

XN-Matt · Sep 19, 2017

I can confirm tested and working without issue over an extended period of time.

Alwin · Sep 19, 2017

What else happend in your cluster at Sep 19 07:11? And how is your corosync setup?

XN-Matt · Sep 19, 2017

Nothing as far as I am aware. Everything was stable and that was the first node that went down and some others followed after.

How is it setup? Could you elaborate on what information you want so I can provide.

Alwin · Sep 19, 2017

How does your /etc/corosync/corosync.conf look like?

XN-Matt · Sep 19, 2017

Here:

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: c1-h5-i
nodeid: 9
quorum_votes: 1
ring0_addr: 10.0.0.18
}

node {
name: c1-h7-i
nodeid: 3
quorum_votes: 1
ring0_addr: 10.0.0.20
}

node {
name: c1-h1-i
nodeid: 4
quorum_votes: 1
ring0_addr: 10.0.0.14
}

node {
name: c1-h4-i
nodeid: 8
quorum_votes: 1
ring0_addr: 10.0.0.17
}

node {
name: c1-h8-i
nodeid: 1
quorum_votes: 1
ring0_addr: 10.0.0.21
}

node {
name: c1-h9-i
nodeid: 5
quorum_votes: 1
ring0_addr: 10.0.0.22
}

node {
name: c1-h3-i
nodeid: 7
quorum_votes: 1
ring0_addr: 10.0.0.16
}

node {
name: c1-h2-i
nodeid: 6
quorum_votes: 1
ring0_addr: 10.0.0.15
}

node {
name: c1-h6-i
nodeid: 2
quorum_votes: 1
ring0_addr: 10.0.0.19
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: c1-rdg-uk
config_version: 27
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 10.0.0.21
ringnumber: 0
}

}

Alwin · Sep 19, 2017

XN-Matt said:
Nothing as far as I am aware. Everything was stable and that was the first node that went down and some others followed after.

Well, something must have happend, otherwise the cluster wouldn't have fallen apart.

Is corosync traffic running on its own network or shared? And if shared, what traffic was running there when it happend (eg. backups, ceph)?

XN-Matt · Sep 19, 2017

It's on it's own network 10.x space on the second NIC.

Backups are not enabled and looking at traffic graphs, nothing more than usual, few Mbit/s.

Public traffic is on the first NIC.

Alwin · Sep 19, 2017

XN-Matt said:
One node went down, then another and another. Initially, we thought this was due to conntrack saying it was full but that occurred on two nodes that didn't reboot (and limit has increased accordingly). Although that may be related?

While conntrack might have stopped the corosync traffic and therefore the nodes fenced. The question, what filled up the conntrack limit remains open.

XN-Matt · Sep 19, 2017

I assume that this could be any of the VMs also so any of those that have high connections could impact the node entirely.

We had a far higher limit set than we were using with Xen. Although, the conntrack errors only appeared on two nodes that didn't restart at all.

Alwin · Sep 19, 2017

If only one host would be flooded with connections, that host would get fenced, as it would not answer in time for corosync. Now I just throw in some thought, if you use a VM in HA (that gets a DOS) and it moves from host to host then it could affect all nodes that are available for relocation.

XN-Matt · Sep 19, 2017

Well, we disabled HA on all the VMs when this occurred last time so nothing had moved around this time.

c1-h5-i and c1-h9-i reported "table full, dropping packet" but didnt get fenced or reboot.

Could the other nodes reboot because either of those nodes tell them to? Can you tell what triggered a node to reboot.

Alwin · Sep 19, 2017

XN-Matt said:
Well, we disabled HA on all the VMs when this occurred last time so nothing had moved around this time.

Then it must be something cluster wide.

XN-Matt said:
c1-h5-i and c1-h9-i reported "table full, dropping packet" but didnt get fenced or reboot.

The other nodes might have not been able to log it.

XN-Matt said:
Could the other nodes reboot because either of those nodes tell them to? Can you tell what triggered a node to reboot.

Depends on your fencing config. The default uses a software watchdog or IPMI if you configured it, in both scenarios, the nodes only fence itself.

XN-Matt · Sep 20, 2017

That still leaves some unanswered questions and also a possible solution, if any, to the issue where a single VM could saturate sessions, causing massive instability on a cluster with nodes rebooting all over the place.

Could be hard to stop too if the nodes keep rebooting before you can identify the VM.

Maybe an ability to see connections per VM?

I'd also be interested to know what conntrack limit others are using.

Alwin · Sep 20, 2017

As you mentioned earlier, that you have separate link for VM traffic on you hosts, then it should only happen on one node that hosts the VM, if not in HA.

I guess, if a VM is the culprit, then you might get something from your monitoring.

XN-Matt · Sep 20, 2017

Noted but sadly that's just not what we're seeing. I wish it was.

Alwin · Sep 20, 2017

Did you also take your switches/router into account? Maybe something can be seen there.

XN-Matt · Sep 20, 2017

That's all been checked. We run an identical configuration on another cluster too.

Switch reports no errors in logs or errors on the ports. As the network is internal, there isn't any router to contend with.

Alwin · Sep 20, 2017

It looks like a packet storm that is triggered by something. This calls for some intensive monitoring/analyzing. nTOP, tcpdump and other tools that help to check what is actually going on around that time.

Cluster Instability

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Proxmox Retired Staff