Lost quorum and all nodes were reset during a disjoin!

alyarb

Well-Known Member
Feb 11, 2020
140
25
48
37
Hello, I have a 15-node cluster with about 50 VM/CT's running, all configured for HA and using the software watchdog. It's hyperconverged using Ceph RBD for the VM storage. All nodes run PVE 6.4-13 and Ceph 15.2.9. The hostnames in the cluster are "node36" through "node50."

Yesterday, I removed one of the nodes from the cluster (pvecm delnode node50). Then, all nodes except node47 and node50 reset!

node50 said it was a standalone node on the cluster GUI, yet it also retained all the other node and VM GUI objects from the cluster that it left. It never fully made it to standalone mode. I do not believe the delnode procedure succeeded. It took about 10 minutes for all services to begin functioning again after the nodes rebooted and then I finally got all the fencing emails.

Most importantly, it makes no sense why 14 uninvolved nodes lost quorum or had all the watchdogs run out. Because Ceph OSDs were present on these nodes, we had unclean shutdown of the VMs as well as their storage. Many VMs are now corrupt and need to be repaired or restored.

Where do I begin with this issue? Is there a command I can send to globally disable/enable HA and fencing during a cluster join/leave procedure?

Here are the last 30 seconds of the syslog leading up to node36 resetting:

Sep 30 15:12:20 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 10
Sep 30 15:12:21 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 20
Sep 30 15:12:22 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 30
Sep 30 15:12:23 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 40
Sep 30 15:12:24 virtual36 ceph-mgr[2671754]: 192.168.199.8 - - [30/Sep/2021:15:12:24] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.22.0"
Sep 30 15:12:24 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 50
Sep 30 15:12:25 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 60
Sep 30 15:12:26 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 70
Sep 30 15:12:27 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 80
Sep 30 15:12:28 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 90
Sep 30 15:12:29 virtual36 ceph-mgr[2671754]: 192.168.199.8 - - [30/Sep/2021:15:12:29] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.22.0"
Sep 30 15:12:29 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 100
Sep 30 15:12:29 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retried 100 times
Sep 30 15:12:29 virtual36 pmxcfs[4515]: [status] crit: cpg_send_message failed: 6
Sep 30 15:12:30 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 10
Sep 30 15:12:31 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 20
Sep 30 15:12:32 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 30
Sep 30 15:12:32 virtual36 corosync[4562]: [QUORUM] Sync members[14]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Sep 30 15:12:32 virtual36 corosync[4562]: [TOTEM ] A new membership (1.9687461706e63e9) was formed. Members
Sep 30 15:12:33 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 40
Sep 30 15:12:34 virtual36 ceph-mgr[2671754]: 192.168.199.8 - - [30/Sep/2021:15:12:34] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.22.0"
Sep 30 15:12:34 virtual36 ceph-mon[4557]: 2021-09-30T15:12:34.326-0400 7ffa7df7e700 -1 mon.virtual36@0(leader) e16 get_health_metrics reporting 2 slow ops, oldest is monmgrreport(0 checks, 0 progress events)
Sep 30 15:12:34 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 50
Sep 30 15:12:35 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 60
Sep 30 15:12:36 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 70
Sep 30 15:12:37 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 80
Sep 30 15:12:38 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 90
Sep 30 15:12:39 virtual36 ceph-mgr[2671754]: 192.168.199.8 - - [30/Sep/2021:15:12:39] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.22.0"
Sep 30 15:12:39 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 100
Sep 30 15:12:39 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retried 100 times
Sep 30 15:12:39 virtual36 pmxcfs[4515]: [status] crit: cpg_send_message failed: 6
Sep 30 15:12:39 virtual36 pve-firewall[4757]: firewall update time (13.481 seconds)
Sep 30 15:12:39 virtual36 pveproxy[2484464]: problem with client ::ffff:192.168.199.232; Connection reset by peer
Sep 30 15:12:40 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 10
Sep 30 15:12:41 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 20
Sep 30 15:12:42 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 30
Sep 30 15:12:43 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 40
Sep 30 15:12:44 virtual36 ceph-mgr[2671754]: 192.168.199.8 - - [30/Sep/2021:15:12:44] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.22.0"
Sep 30 15:12:44 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 50
Sep 30 15:12:45 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 60
Sep 30 15:12:45 virtual36 corosync[4562]: [QUORUM] Sync members[14]: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Sep 30 15:12:45 virtual36 corosync[4562]: [TOTEM ] A new membership (1.9687461706e63ed) was formed. Members
Sep 30 15:12:46 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 70
Sep 30 15:12:47 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 80
Sep 30 15:12:48 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 90
Sep 30 15:12:49 virtual36 ceph-mgr[2671754]: 192.168.199.8 - - [30/Sep/2021:15:12:49] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.22.0"
Sep 30 15:12:49 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 100
Sep 30 15:12:49 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retried 100 times
Sep 30 15:12:49 virtual36 pmxcfs[4515]: [status] crit: cpg_send_message failed: 6
Sep 30 15:12:49 virtual36 pve-firewall[4757]: firewall update time (10.011 seconds)
Sep 30 15:12:50 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 10
Sep 30 15:12:51 virtual36 pmxcfs[4515]: [status] notice: cpg_send_message retry 20
 
Yesterday, I removed one of the nodes from the cluster (pvecm delnode node50). Then, all nodes except node47 and node50 reset!
Why would you expect node50 to be reset? Did you follow the recommended procedure from our documentation? Note especially this part:
At this point you must power off [node50] and make sure that it will not power on again (in the network) as it is. [...] If you power on the node as it is, your cluster will be screwed up and it could be difficult to restore a clean cluster state.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!