Last week something happened where nodes just started rebooting each other because I assume, they thought things were down so we removed all VMs from cluster.
This morning, a similar thing happened.
One node went down, then another and another. Initially, we thought this was due to conntrack saying it was full but that occurred on two nodes that didn't reboot (and limit has increased accordingly). Although that may be related?
Here are some logs from the first node that rebooted.
Any suggestions on further debug would be appreciated as twice in a week is confusion. (Switch is set correctly in regard to multicast/IGMP snooping).
This morning, a similar thing happened.
One node went down, then another and another. Initially, we thought this was due to conntrack saying it was full but that occurred on two nodes that didn't reboot (and limit has increased accordingly). Although that may be related?
Here are some logs from the first node that rebooted.
Any suggestions on further debug would be appreciated as twice in a week is confusion. (Switch is set correctly in regard to multicast/IGMP snooping).
Sep 19 07:11:41 c1-h7-i corosync[1106]: notice [TOTEM ] A new membership (10.0.0.14:1056) was formed. Members left: 1 5
Sep 19 07:11:41 c1-h7-i corosync[1106]: notice [TOTEM ] Failed to receive the leave message. failed: 1 5
Sep 19 07:11:41 c1-h7-i corosync[1106]: [TOTEM ] A new membership (10.0.0.14:1056) was formed. Members left: 1 5
Sep 19 07:11:41 c1-h7-i corosync[1106]: [TOTEM ] Failed to receive the leave message. failed: 1 5
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: members: 2/1105, 3/1093, 4/1038, 6/1062, 7/1045, 8/1068, 9/1049
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: starting data syncronisation
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: members: 2/1105, 3/1093, 4/1038, 6/1062, 7/1045, 8/1068, 9/1049
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: starting data syncronisation
Sep 19 07:11:41 c1-h7-i corosync[1106]: notice [QUORUM] Members[7]: 4 6 7 8 9 2 3
Sep 19 07:11:41 c1-h7-i corosync[1106]: notice [MAIN ] Completed service synchronization, ready to provide service.
Sep 19 07:11:41 c1-h7-i corosync[1106]: [QUORUM] Members[7]: 4 6 7 8 9 2 3
Sep 19 07:11:41 c1-h7-i corosync[1106]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: received sync request (epoch 2/1105/00000027)
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: received sync request (epoch 2/1105/00000022)
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: received all states
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: leader is 2/1105
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: synced members: 2/1105, 3/1093, 4/1038, 6/1062, 7/1045, 8/1068, 9/1049
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: all data is up to date
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: dfsm_deliver_queue: queue length 20
Sep 19 07:11:41 c1-h7-i pve-ha-crm[1168]: loop take too long (33 seconds)
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: received all states
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: all data is up to date
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: dfsm_deliver_queue: queue length 330
Sep 19 07:11:46 c1-h7-i corosync[1106]: warning [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Sep 19 07:11:46 c1-h7-i corosync[1106]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Sep 19 07:11:51 c1-h7-i pve-ha-lrm[1179]: loop take too long (34 seconds)
Sep 19 07:11:57 c1-h7-i corosync[1106]: warning [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Sep 19 07:11:57 c1-h7-i corosync[1106]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Sep 19 07:11:58 c1-h7-i corosync[1106]: notice [TOTEM ] A new membership (10.0.0.14:1072) was formed. Members joined: 1
Sep 19 07:11:58 c1-h7-i corosync[1106]: [TOTEM ] A new membership (10.0.0.14:1072) was formed. Members joined: 1
Sep 19 07:11:58 c1-h7-i pmxcfs[1093]: [status] notice: members: 1/1076, 2/1105, 3/1093, 4/1038, 6/1062, 7/1045, 8/1068, 9/1049
Sep 19 07:11:58 c1-h7-i pmxcfs[1093]: [status] notice: starting data syncronisation
Sep 19 07:11:58 c1-h7-i corosync[1106]: notice [QUORUM] Members[8]: 4 6 7 8 9 2 3 1
Sep 19 07:11:58 c1-h7-i corosync[1106]: notice [MAIN ] Completed service synchronization, ready to provide service.
Sep 19 07:11:41 c1-h7-i corosync[1106]: notice [TOTEM ] Failed to receive the leave message. failed: 1 5
Sep 19 07:11:41 c1-h7-i corosync[1106]: [TOTEM ] A new membership (10.0.0.14:1056) was formed. Members left: 1 5
Sep 19 07:11:41 c1-h7-i corosync[1106]: [TOTEM ] Failed to receive the leave message. failed: 1 5
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: members: 2/1105, 3/1093, 4/1038, 6/1062, 7/1045, 8/1068, 9/1049
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: starting data syncronisation
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: members: 2/1105, 3/1093, 4/1038, 6/1062, 7/1045, 8/1068, 9/1049
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: starting data syncronisation
Sep 19 07:11:41 c1-h7-i corosync[1106]: notice [QUORUM] Members[7]: 4 6 7 8 9 2 3
Sep 19 07:11:41 c1-h7-i corosync[1106]: notice [MAIN ] Completed service synchronization, ready to provide service.
Sep 19 07:11:41 c1-h7-i corosync[1106]: [QUORUM] Members[7]: 4 6 7 8 9 2 3
Sep 19 07:11:41 c1-h7-i corosync[1106]: [MAIN ] Completed service synchronization, ready to provide service.
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: received sync request (epoch 2/1105/00000027)
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: received sync request (epoch 2/1105/00000022)
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: received all states
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: leader is 2/1105
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: synced members: 2/1105, 3/1093, 4/1038, 6/1062, 7/1045, 8/1068, 9/1049
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: all data is up to date
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [dcdb] notice: dfsm_deliver_queue: queue length 20
Sep 19 07:11:41 c1-h7-i pve-ha-crm[1168]: loop take too long (33 seconds)
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: received all states
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: all data is up to date
Sep 19 07:11:41 c1-h7-i pmxcfs[1093]: [status] notice: dfsm_deliver_queue: queue length 330
Sep 19 07:11:46 c1-h7-i corosync[1106]: warning [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Sep 19 07:11:46 c1-h7-i corosync[1106]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Sep 19 07:11:51 c1-h7-i pve-ha-lrm[1179]: loop take too long (34 seconds)
Sep 19 07:11:57 c1-h7-i corosync[1106]: warning [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Sep 19 07:11:57 c1-h7-i corosync[1106]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
Sep 19 07:11:58 c1-h7-i corosync[1106]: notice [TOTEM ] A new membership (10.0.0.14:1072) was formed. Members joined: 1
Sep 19 07:11:58 c1-h7-i corosync[1106]: [TOTEM ] A new membership (10.0.0.14:1072) was formed. Members joined: 1
Sep 19 07:11:58 c1-h7-i pmxcfs[1093]: [status] notice: members: 1/1076, 2/1105, 3/1093, 4/1038, 6/1062, 7/1045, 8/1068, 9/1049
Sep 19 07:11:58 c1-h7-i pmxcfs[1093]: [status] notice: starting data syncronisation
Sep 19 07:11:58 c1-h7-i corosync[1106]: notice [QUORUM] Members[8]: 4 6 7 8 9 2 3 1
Sep 19 07:11:58 c1-h7-i corosync[1106]: notice [MAIN ] Completed service synchronization, ready to provide service.