Cluster Instability

XN-Matt

Well-Known Member
Aug 21, 2017
90
7
48
42
Last week something happened where nodes just started rebooting each other because I assume, they thought things were down so we removed all VMs from cluster.

This morning, a similar thing happened.

One node went down, then another and another. Initially, we thought this was due to conntrack saying it was full but that occurred on two nodes that didn't reboot (and limit has increased accordingly). Although that may be related?

Here are some logs from the first node that rebooted.

Any suggestions on further debug would be appreciated as twice in a week is confusion. (Switch is set correctly in regard to multicast/IGMP snooping).

Sep 19 07:11:41 c1-h7-i corosync[1106]: notice [TOTEM ] A new membership (10.0.0.14:1056) was formed. Members left: 1 5
Sep 19 07:11:41 c1-h7-i corosync[1106]: notice [TOTEM ] Failed to receive the leave message. failed: 1 5
Sep 19 07:11:41 c1-h7-i corosync[1106]: [TOTEM ] A new membership (10.0.0.14:1056) was formed. Members left: 1 5
Sep 19 07:11:41 c1-h7-i corosync[1106]: [TOTEM ] Failed to receive the leave message. failed: 1 5
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: members: 2/1105, 3/1093, 4/1038, 6/1062, 7/1045, 8/1068, 9/1049
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: starting data syncronisation
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: members: 2/1105, 3/1093, 4/1038, 6/1062, 7/1045, 8/1068, 9/1049
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: starting data syncronisation
Sep 19 07:11:41 c1-h7-i corosync[1106]: notice [QUORUM] Members[7]: 4 6 7 8 9 2 3
Sep 19 07:11:41 c1-h7-i corosync[1106]: notice [MAIN ] Completed service synchronization, ready to provide service.
Sep 19 07:11:41 c1-h7-i corosync[1106]: [QUORUM] Members[7]: 4 6 7 8 9 2 3
Sep 19 07:11:41 c1-h7-i corosync[1106]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: received sync request (epoch 2/1105/00000027)
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: received sync request (epoch 2/1105/00000022)
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: received all states
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: leader is 2/1105
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: synced members: 2/1105, 3/1093, 4/1038, 6/1062, 7/1045, 8/1068, 9/1049
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: all data is up to date
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: dfsm_deliver_queue: queue length 20
Sep 19 07:11:41 c1-h7-i pve-ha-crm[1168]: loop take too long (33 seconds)
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: received all states
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: all data is up to date
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: dfsm_deliver_queue: queue length 330
Sep 19 07:11:46 c1-h7-i corosync[1106]: warning [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Sep 19 07:11:46 c1-h7-i corosync[1106]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Sep 19 07:11:51 c1-h7-i pve-ha-lrm[1179]: loop take too long (34 seconds)
Sep 19 07:11:57 c1-h7-i corosync[1106]: warning [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Sep 19 07:11:57 c1-h7-i corosync[1106]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Sep 19 07:11:58 c1-h7-i corosync[1106]: notice [TOTEM ] A new membership (10.0.0.14:1072) was formed. Members joined: 1
Sep 19 07:11:58 c1-h7-i corosync[1106]: [TOTEM ] A new membership (10.0.0.14:1072) was formed. Members joined: 1
Sep 19 07:11:58 c1-h7-i pmxcfs[1093]: [status] notice: members: 1/1076, 2/1105, 3/1093, 4/1038, 6/1062, 7/1045, 8/1068, 9/1049
Sep 19 07:11:58 c1-h7-i pmxcfs[1093]: [status] notice: starting data syncronisation
Sep 19 07:11:58 c1-h7-i corosync[1106]: notice [QUORUM] Members[8]: 4 6 7 8 9 2 3 1
Sep 19 07:11:58 c1-h7-i corosync[1106]: notice [MAIN ] Completed service synchronization, ready to provide service.
 
I can confirm tested and working without issue over an extended period of time.
 
What else happend in your cluster at Sep 19 07:11? And how is your corosync setup?
 
Nothing as far as I am aware. Everything was stable and that was the first node that went down and some others followed after.

How is it setup? Could you elaborate on what information you want so I can provide.
 
How does your /etc/corosync/corosync.conf look like?
 
Here:

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: c1-h5-i
nodeid: 9
quorum_votes: 1
ring0_addr: 10.0.0.18
}

node {
name: c1-h7-i
nodeid: 3
quorum_votes: 1
ring0_addr: 10.0.0.20
}

node {
name: c1-h1-i
nodeid: 4
quorum_votes: 1
ring0_addr: 10.0.0.14
}

node {
name: c1-h4-i
nodeid: 8
quorum_votes: 1
ring0_addr: 10.0.0.17
}

node {
name: c1-h8-i
nodeid: 1
quorum_votes: 1
ring0_addr: 10.0.0.21
}

node {
name: c1-h9-i
nodeid: 5
quorum_votes: 1
ring0_addr: 10.0.0.22
}

node {
name: c1-h3-i
nodeid: 7
quorum_votes: 1
ring0_addr: 10.0.0.16
}

node {
name: c1-h2-i
nodeid: 6
quorum_votes: 1
ring0_addr: 10.0.0.15
}

node {
name: c1-h6-i
nodeid: 2
quorum_votes: 1
ring0_addr: 10.0.0.19
}

}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: c1-rdg-uk
config_version: 27
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 10.0.0.21
ringnumber: 0
}

}
 
Nothing as far as I am aware. Everything was stable and that was the first node that went down and some others followed after.
Well, something must have happend, otherwise the cluster wouldn't have fallen apart.

Is corosync traffic running on its own network or shared? And if shared, what traffic was running there when it happend (eg. backups, ceph)?
 
It's on it's own network 10.x space on the second NIC.

Backups are not enabled and looking at traffic graphs, nothing more than usual, few Mbit/s.

Public traffic is on the first NIC.
 
One node went down, then another and another. Initially, we thought this was due to conntrack saying it was full but that occurred on two nodes that didn't reboot (and limit has increased accordingly). Although that may be related?
While conntrack might have stopped the corosync traffic and therefore the nodes fenced. The question, what filled up the conntrack limit remains open.
 
I assume that this could be any of the VMs also so any of those that have high connections could impact the node entirely.

We had a far higher limit set than we were using with Xen. Although, the conntrack errors only appeared on two nodes that didn't restart at all.
 
If only one host would be flooded with connections, that host would get fenced, as it would not answer in time for corosync. Now I just throw in some thought, if you use a VM in HA (that gets a DOS) and it moves from host to host then it could affect all nodes that are available for relocation.
 
Well, we disabled HA on all the VMs when this occurred last time so nothing had moved around this time.

c1-h5-i and c1-h9-i reported "table full, dropping packet" but didnt get fenced or reboot.

Could the other nodes reboot because either of those nodes tell them to? Can you tell what triggered a node to reboot.
 
Well, we disabled HA on all the VMs when this occurred last time so nothing had moved around this time.
Then it must be something cluster wide.

c1-h5-i and c1-h9-i reported "table full, dropping packet" but didnt get fenced or reboot.
The other nodes might have not been able to log it.

Could the other nodes reboot because either of those nodes tell them to? Can you tell what triggered a node to reboot.
Depends on your fencing config. The default uses a software watchdog or IPMI if you configured it, in both scenarios, the nodes only fence itself.
 
That still leaves some unanswered questions and also a possible solution, if any, to the issue where a single VM could saturate sessions, causing massive instability on a cluster with nodes rebooting all over the place.

Could be hard to stop too if the nodes keep rebooting before you can identify the VM.

Maybe an ability to see connections per VM?

I'd also be interested to know what conntrack limit others are using.
 
As you mentioned earlier, that you have separate link for VM traffic on you hosts, then it should only happen on one node that hosts the VM, if not in HA.

I guess, if a VM is the culprit, then you might get something from your monitoring.
 
Did you also take your switches/router into account? Maybe something can be seen there.
 
That's all been checked. We run an identical configuration on another cluster too.

Switch reports no errors in logs or errors on the ports. As the network is internal, there isn't any router to contend with.
 
It looks like a packet storm that is triggered by something. This calls for some intensive monitoring/analyzing. nTOP, tcpdump and other tools that help to check what is actually going on around that time.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!