Hello, I have a 15-node cluster with about 50 VM/CT's running, all configured for HA and using the software watchdog. It's hyperconverged using Ceph RBD for the VM storage. All nodes run PVE 6.4-13 and Ceph 15.2.9. The hostnames in the cluster are "node36" through "node50."
Yesterday, I removed one of the nodes from the cluster (pvecm delnode node50). Then, all nodes except node47 and node50 reset!
node50 said it was a standalone node on the cluster GUI, yet it also retained all the other node and VM GUI objects from the cluster that it left. It never fully made it to standalone mode. I do not believe the delnode procedure succeeded. It took about 10 minutes for all services to begin functioning again after the nodes rebooted and then I finally got all the fencing emails.
Most importantly, it makes no sense why 14 uninvolved nodes lost quorum or had all the watchdogs run out. Because Ceph OSDs were present on these nodes, we had unclean shutdown of the VMs as well as their storage. Many VMs are now corrupt and need to be repaired or restored.
Where do I begin with this issue? Is there a command I can send to globally disable/enable HA and fencing during a cluster join/leave procedure?
Here are the last 30 seconds of the syslog leading up to node36 resetting:
Sep 30 15:12:20 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 10
Sep 30 15:12:21 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 20
Sep 30 15:12:22 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 30
Sep 30 15:12:23 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 40
Sep 30 15:12:24 virtual36 ceph-mgr[2671754]: 192.168.199.8 - - [30/Sep/2021:15:12:24] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.22.0"
Sep 30 15:12:24 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 50
Sep 30 15:12:25 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 60
Sep 30 15:12:26 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 70
Sep 30 15:12:27 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 80
Sep 30 15:12:28 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 90
Sep 30 15:12:29 virtual36 ceph-mgr[2671754]: 192.168.199.8 - - [30/Sep/2021:15:12:29] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.22.0"
Sep 30 15:12:29 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 100
Sep 30 15:12:29 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retried 100 times
Sep 30 15:12:29 virtual36 pmxcfs[4515]: [status] crit: cpg_send_message failed: 6
Sep 30 15:12:30 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 10
Sep 30 15:12:31 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 20
Sep 30 15:12:32 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 30
Sep 30 15:12:32 virtual36 corosync[4562]: [QUORUM] Sync members[14]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Sep 30 15:12:32 virtual36 corosync[4562]: [TOTEM ] A new membership (1.9687461706e63e9) was formed. Members
Sep 30 15:12:33 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 40
Sep 30 15:12:34 virtual36 ceph-mgr[2671754]: 192.168.199.8 - - [30/Sep/2021:15:12:34] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.22.0"
Sep 30 15:12:34 virtual36 ceph-mon[4557]: 2021-09-30T15:12:34.326-0400 7ffa7df7e700 -1 mon.virtual36@0(leader) e16 get_health_metrics reporting 2 slow ops, oldest is monmgrreport(0 checks, 0 progress events)
Sep 30 15:12:34 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 50
Sep 30 15:12:35 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 60
Sep 30 15:12:36 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 70
Sep 30 15:12:37 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 80
Sep 30 15:12:38 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 90
Sep 30 15:12:39 virtual36 ceph-mgr[2671754]: 192.168.199.8 - - [30/Sep/2021:15:12:39] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.22.0"
Sep 30 15:12:39 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 100
Sep 30 15:12:39 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retried 100 times
Sep 30 15:12:39 virtual36 pmxcfs[4515]: [status] crit: cpg_send_message failed: 6
Sep 30 15:12:39 virtual36 pve-firewall[4757]: firewall update time (13.481 seconds)
Sep 30 15:12:39 virtual36 pveproxy[2484464]: problem with client ::ffff:192.168.199.232; Connection reset by peer
Sep 30 15:12:40 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 10
Sep 30 15:12:41 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 20
Sep 30 15:12:42 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 30
Sep 30 15:12:43 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 40
Sep 30 15:12:44 virtual36 ceph-mgr[2671754]: 192.168.199.8 - - [30/Sep/2021:15:12:44] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.22.0"
Sep 30 15:12:44 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 50
Sep 30 15:12:45 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 60
Sep 30 15:12:45 virtual36 corosync[4562]: [QUORUM] Sync members[14]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Sep 30 15:12:45 virtual36 corosync[4562]: [TOTEM ] A new membership (1.9687461706e63ed) was formed. Members
Sep 30 15:12:46 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 70
Sep 30 15:12:47 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 80
Sep 30 15:12:48 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 90
Sep 30 15:12:49 virtual36 ceph-mgr[2671754]: 192.168.199.8 - - [30/Sep/2021:15:12:49] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.22.0"
Sep 30 15:12:49 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 100
Sep 30 15:12:49 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retried 100 times
Sep 30 15:12:49 virtual36 pmxcfs[4515]: [status] crit: cpg_send_message failed: 6
Sep 30 15:12:49 virtual36 pve-firewall[4757]: firewall update time (10.011 seconds)
Sep 30 15:12:50 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 10
Sep 30 15:12:51 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 20
Yesterday, I removed one of the nodes from the cluster (pvecm delnode node50). Then, all nodes except node47 and node50 reset!
node50 said it was a standalone node on the cluster GUI, yet it also retained all the other node and VM GUI objects from the cluster that it left. It never fully made it to standalone mode. I do not believe the delnode procedure succeeded. It took about 10 minutes for all services to begin functioning again after the nodes rebooted and then I finally got all the fencing emails.
Most importantly, it makes no sense why 14 uninvolved nodes lost quorum or had all the watchdogs run out. Because Ceph OSDs were present on these nodes, we had unclean shutdown of the VMs as well as their storage. Many VMs are now corrupt and need to be repaired or restored.
Where do I begin with this issue? Is there a command I can send to globally disable/enable HA and fencing during a cluster join/leave procedure?
Here are the last 30 seconds of the syslog leading up to node36 resetting:
Sep 30 15:12:20 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 10
Sep 30 15:12:21 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 20
Sep 30 15:12:22 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 30
Sep 30 15:12:23 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 40
Sep 30 15:12:24 virtual36 ceph-mgr[2671754]: 192.168.199.8 - - [30/Sep/2021:15:12:24] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.22.0"
Sep 30 15:12:24 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 50
Sep 30 15:12:25 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 60
Sep 30 15:12:26 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 70
Sep 30 15:12:27 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 80
Sep 30 15:12:28 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 90
Sep 30 15:12:29 virtual36 ceph-mgr[2671754]: 192.168.199.8 - - [30/Sep/2021:15:12:29] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.22.0"
Sep 30 15:12:29 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 100
Sep 30 15:12:29 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retried 100 times
Sep 30 15:12:29 virtual36 pmxcfs[4515]: [status] crit: cpg_send_message failed: 6
Sep 30 15:12:30 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 10
Sep 30 15:12:31 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 20
Sep 30 15:12:32 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 30
Sep 30 15:12:32 virtual36 corosync[4562]: [QUORUM] Sync members[14]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Sep 30 15:12:32 virtual36 corosync[4562]: [TOTEM ] A new membership (1.9687461706e63e9) was formed. Members
Sep 30 15:12:33 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 40
Sep 30 15:12:34 virtual36 ceph-mgr[2671754]: 192.168.199.8 - - [30/Sep/2021:15:12:34] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.22.0"
Sep 30 15:12:34 virtual36 ceph-mon[4557]: 2021-09-30T15:12:34.326-0400 7ffa7df7e700 -1 mon.virtual36@0(leader) e16 get_health_metrics reporting 2 slow ops, oldest is monmgrreport(0 checks, 0 progress events)
Sep 30 15:12:34 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 50
Sep 30 15:12:35 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 60
Sep 30 15:12:36 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 70
Sep 30 15:12:37 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 80
Sep 30 15:12:38 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 90
Sep 30 15:12:39 virtual36 ceph-mgr[2671754]: 192.168.199.8 - - [30/Sep/2021:15:12:39] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.22.0"
Sep 30 15:12:39 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 100
Sep 30 15:12:39 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retried 100 times
Sep 30 15:12:39 virtual36 pmxcfs[4515]: [status] crit: cpg_send_message failed: 6
Sep 30 15:12:39 virtual36 pve-firewall[4757]: firewall update time (13.481 seconds)
Sep 30 15:12:39 virtual36 pveproxy[2484464]: problem with client ::ffff:192.168.199.232; Connection reset by peer
Sep 30 15:12:40 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 10
Sep 30 15:12:41 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 20
Sep 30 15:12:42 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 30
Sep 30 15:12:43 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 40
Sep 30 15:12:44 virtual36 ceph-mgr[2671754]: 192.168.199.8 - - [30/Sep/2021:15:12:44] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.22.0"
Sep 30 15:12:44 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 50
Sep 30 15:12:45 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 60
Sep 30 15:12:45 virtual36 corosync[4562]: [QUORUM] Sync members[14]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Sep 30 15:12:45 virtual36 corosync[4562]: [TOTEM ] A new membership (1.9687461706e63ed) was formed. Members
Sep 30 15:12:46 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 70
Sep 30 15:12:47 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 80
Sep 30 15:12:48 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 90
Sep 30 15:12:49 virtual36 ceph-mgr[2671754]: 192.168.199.8 - - [30/Sep/2021:15:12:49] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.22.0"
Sep 30 15:12:49 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 100
Sep 30 15:12:49 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retried 100 times
Sep 30 15:12:49 virtual36 pmxcfs[4515]: [status] crit: cpg_send_message failed: 6
Sep 30 15:12:49 virtual36 pve-firewall[4757]: firewall update time (10.011 seconds)
Sep 30 15:12:50 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 10
Sep 30 15:12:51 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 20