Hello,
Our problems started this week after we tried to upgrade a node in a 4-node cluster from 4.4. to 5.0
The upgrade itself went fine. After first reboot the whole cluster went offline due to fencing.
This repeated every single time we tried to bring back this node online again.
So, I removed the offending node from the cluster, reinstalled 5.0 from scratch & rejoined. Went fine. All nodes up, quorum ok. Until reboot of this node.
It's all fine when it's down (I have a 3-node quorum, 1 node down). As soon as this node comes back up again, all nodes fence themselves.
This is from the log, the offending node is 172.20.10.3. As soon as it joins, woe begins:
Aug 03 10:43:32 nthl12 corosync[4913]: [TOTEM ] A new membership (172.20.10.3:2036) was formed. Members joined: 3
Aug 03 10:43:39 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 10
Aug 03 10:43:40 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 20
Aug 03 10:43:41 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 30
Aug 03 10:43:42 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 40
Aug 03 10:43:43 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 50
Aug 03 10:43:44 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 60
Aug 03 10:43:45 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 70
Aug 03 10:43:46 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 80
Aug 03 10:43:47 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 90
Aug 03 10:43:48 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 100
Aug 03 10:43:48 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retried 100 times
Aug 03 10:43:48 nthl12 pmxcfs[4698]: [status] crit: cpg_send_message failed: 6
Aug 03 10:43:48 nthl12 pve-firewall[4917]: firewall update time (8.625 seconds)
Aug 03 10:43:49 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 10
Aug 03 10:43:50 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 20
Aug 03 10:43:51 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 30
Aug 03 10:43:52 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 40
Aug 03 10:43:53 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 50
Aug 03 10:43:54 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 60
Aug 03 10:43:55 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 70
Aug 03 10:43:56 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 80
Aug 03 10:43:57 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 90
Aug 03 10:43:58 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 100
Aug 03 10:43:58 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retried 100 times
Aug 03 10:43:58 nthl12 pmxcfs[4698]: [status] crit: cpg_send_message failed: 6
Aug 03 10:43:58 nthl12 pve-firewall[4917]: firewall update time (9.010 seconds)
Aug 03 10:43:59 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 10
Aug 03 10:44:00 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 20
Aug 03 10:44:01 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 30
Aug 03 10:44:02 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 40
Aug 03 10:44:03 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 50
Aug 03 10:44:04 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 60
Aug 03 10:44:05 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 70
How to troubleshoot and what to do?
Our problems started this week after we tried to upgrade a node in a 4-node cluster from 4.4. to 5.0
The upgrade itself went fine. After first reboot the whole cluster went offline due to fencing.
This repeated every single time we tried to bring back this node online again.
So, I removed the offending node from the cluster, reinstalled 5.0 from scratch & rejoined. Went fine. All nodes up, quorum ok. Until reboot of this node.
It's all fine when it's down (I have a 3-node quorum, 1 node down). As soon as this node comes back up again, all nodes fence themselves.
This is from the log, the offending node is 172.20.10.3. As soon as it joins, woe begins:
Aug 03 10:43:32 nthl12 corosync[4913]: [TOTEM ] A new membership (172.20.10.3:2036) was formed. Members joined: 3
Aug 03 10:43:39 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 10
Aug 03 10:43:40 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 20
Aug 03 10:43:41 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 30
Aug 03 10:43:42 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 40
Aug 03 10:43:43 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 50
Aug 03 10:43:44 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 60
Aug 03 10:43:45 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 70
Aug 03 10:43:46 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 80
Aug 03 10:43:47 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 90
Aug 03 10:43:48 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 100
Aug 03 10:43:48 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retried 100 times
Aug 03 10:43:48 nthl12 pmxcfs[4698]: [status] crit: cpg_send_message failed: 6
Aug 03 10:43:48 nthl12 pve-firewall[4917]: firewall update time (8.625 seconds)
Aug 03 10:43:49 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 10
Aug 03 10:43:50 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 20
Aug 03 10:43:51 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 30
Aug 03 10:43:52 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 40
Aug 03 10:43:53 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 50
Aug 03 10:43:54 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 60
Aug 03 10:43:55 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 70
Aug 03 10:43:56 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 80
Aug 03 10:43:57 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 90
Aug 03 10:43:58 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 100
Aug 03 10:43:58 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retried 100 times
Aug 03 10:43:58 nthl12 pmxcfs[4698]: [status] crit: cpg_send_message failed: 6
Aug 03 10:43:58 nthl12 pve-firewall[4917]: firewall update time (9.010 seconds)
Aug 03 10:43:59 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 10
Aug 03 10:44:00 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 20
Aug 03 10:44:01 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 30
Aug 03 10:44:02 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 40
Aug 03 10:44:03 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 50
Aug 03 10:44:04 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 60
Aug 03 10:44:05 nthl12 pmxcfs[4698]: [status] notice: cpg_send_message retry 70
How to troubleshoot and what to do?