Hi all, we are facing a problem which is HA behavior, the other day we lost network connectivity with one of the three datacenters which was hosting 1/3 of the hypervisors from the cluster, the cluster itself consists of 37 hypervisors. Expectedly 1/3 group of failed hypervisors tried to form a quorum, but unsuccessfully
and 2/3 of the cluster successfully formed a quorum, but for some inexplicable reason, HA decided to reboot all nodes of the newly assembled cluster and only then continue to function normally, which caused an inevitable crash of all running VMs in the entire cluster for about ~7min
Probably this behavior was triggered by pve-ha-crm.service, but it's not quite clear why, I'm familiar with something similar in the track,
but I do not fully understand how to avoid such behavior HA, because the failure is not advance work and prepare for them using for example
useless
I'll attach some data about my PVE configuration
Code:
Mar 27 23:51:16 avi-proxmox-infra-sd01 corosync[323214]: [QUORUM] Sync members[14]: 4 11 12 13 17 18 19 22 23 27 28 31 32 33
Mar 27 23:51:16 avi-proxmox-infra-sd01 corosync[323214]: [QUORUM] Sync left[23]: 1 2 3 5 6 7 8 9 10 14 15 16 20 21 24 25 26 29 30 34 35 36 37
Mar 27 23:51:16 avi-proxmox-infra-sd01 corosync[323214]: [TOTEM ] A new membership (4.5c3a) was formed. Members left: 1 2 3 5 6 7 8 9 10 14 15 16 20 21 24 25 26 29 30 34 35 36 37
Mar 27 23:51:16 avi-proxmox-infra-sd01 corosync[323214]: [TOTEM ] Failed to receive the leave message. failed: 1 2 3 5 6 7 8 9 10 14 15 16 20 21 24 25 26 29 30 34 35 36 37
Mar 27 23:51:16 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] notice: members: 4/8686, 11/9484, 12/8262, 13/339019, 17/8986, 18/9005, 19/9035, 22/8632, 23/8125, 27/8068, 28/8339, 31/8036, 32/11695, 33/11671
Mar 27 23:51:16 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] notice: starting data syncronisation
Mar 27 23:51:16 avi-proxmox-infra-sd01 pmxcfs[9484]: [status] notice: members: 4/8686, 11/9484, 12/8262, 13/339019, 17/8986, 18/9005, 19/9035, 22/8632, 23/8125, 27/8068, 28/8339, 31/8036, 32/11695, 33/11671
Mar 27 23:51:16 avi-proxmox-infra-sd01 pmxcfs[9484]: [status] notice: starting data syncronisation
Mar 27 23:51:16 avi-proxmox-infra-sd01 corosync[323214]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Mar 27 23:51:16 avi-proxmox-infra-sd01 corosync[323214]: [QUORUM] Members[14]: 4 11 12 13 17 18 19 22 23 27 28 31 32 33
Mar 27 23:51:16 avi-proxmox-infra-sd01 corosync[323214]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 27 23:51:16 avi-proxmox-infra-sd01 pmxcfs[9484]: [status] notice: node lost quorum
Mar 27 23:51:16 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] notice: received sync request (epoch 4/8686/0000000A)
Mar 27 23:51:16 avi-proxmox-infra-sd01 pmxcfs[9484]: [status] notice: received sync request (epoch 4/8686/0000000A)
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] notice: received all states
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] notice: leader is 4/8686
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] notice: synced members: 4/8686, 11/9484, 12/8262, 13/339019, 17/8986, 18/9005, 19/9035, 22/8632, 23/8125, 27/8068, 28/8339, 31/8036, 32/11695, 33/11671
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] notice: all data is up to date
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] notice: dfsm_deliver_queue: queue length 56
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] crit: received write while not quorate - trigger resync
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] crit: leaving CPG group
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [status] notice: received all states
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [status] notice: all data is up to date
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [status] notice: dfsm_deliver_queue: queue length 162
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] notice: start cluster connection
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] crit: cpg_join failed: 14
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] crit: can't initialize service
Mar 27 23:51:17 avi-proxmox-infra-sd01 pve-ha-lrm[22747]: lost lock 'ha_agent_avi-proxmox-infra-sd01_lock - cfs lock update failed - Device or resource busy
Mar 27 23:51:17 avi-proxmox-infra-sd01 pve-ha-lrm[22747]: status change active => lost_agent_lock
Mar 27 23:51:17 avi-proxmox-infra-sd01 pvescheduler[546533]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Mar 27 23:51:17 avi-proxmox-infra-sd01 pvescheduler[546532]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Code:
Mar 27 23:51:16 avi-proxmox-infra-ix01 corosync[188611]: [QUORUM] Sync members[23]: 1 2 3 5 6 7 8 9 10 14 15 16 20 21 24 25 26 29 30 34 35 36 37
Mar 27 23:51:16 avi-proxmox-infra-ix01 corosync[188611]: [QUORUM] Sync left[14]: 4 11 12 13 17 18 19 22 23 27 28 31 32 33
Mar 27 23:51:16 avi-proxmox-infra-ix01 corosync[188611]: [TOTEM ] A new membership (1.5c3a) was formed. Members left: 4 11 12 13 17 18 19 22 23 27 28 31 32 33
Mar 27 23:51:16 avi-proxmox-infra-ix01 corosync[188611]: [TOTEM ] Failed to receive the leave message. failed: 4 11 12 13 17 18 19 22 23 27 28 31 32 33
Mar 27 23:51:16 avi-proxmox-infra-ix01 pmxcfs[8395]: [dcdb] notice: members: 1/8280, 2/312011, 3/8268, 5/7515, 6/7216, 7/7306, 8/8102, 9/8279, 10/8268, 14/7314, 15/7211, 16/7201, 20/8395, 21/8894, 24/8406, 25/8160, 26/8474, 29/8163, 30/8014, 34/11715, 35/11616, 36/10240, 37/11535
Mar 27 23:51:16 avi-proxmox-infra-ix01 pmxcfs[8395]: [dcdb] notice: starting data syncronisation
Mar 27 23:51:16 avi-proxmox-infra-ix01 pmxcfs[8395]: [status] notice: members: 1/8280, 2/312011, 3/8268, 5/7515, 6/7216, 7/7306, 8/8102, 9/8279, 10/8268, 14/7314, 15/7211, 16/7201, 20/8395, 21/8894, 24/8406, 25/8160, 26/8474, 29/8163, 30/8014, 34/11715, 35/11616, 36/10240, 37/11535
Mar 27 23:51:16 avi-proxmox-infra-ix01 pmxcfs[8395]: [status] notice: starting data syncronisation
Mar 27 23:51:16 avi-proxmox-infra-ix01 corosync[188611]: [QUORUM] Members[23]: 1 2 3 5 6 7 8 9 10 14 15 16 20 21 24 25 26 29 30 34 35 36 37
Mar 27 23:51:16 avi-proxmox-infra-ix01 corosync[188611]: [MAIN ] Completed service synchronization, ready to provide service.
Mar 27 23:51:17 avi-proxmox-infra-ix01 pmxcfs[8395]: [dcdb] notice: received sync request (epoch 1/8280/00000025)
Mar 27 23:51:17 avi-proxmox-infra-ix01 pmxcfs[8395]: [status] notice: received sync request (epoch 1/8280/00000025)
Mar 27 23:51:17 avi-proxmox-infra-ix01 collectd[651285]: ntpoffset plugin: failed to read offset from 10.160.82.31
Mar 27 23:51:17 avi-proxmox-infra-ix01 pmxcfs[8395]: [dcdb] notice: received all states
Mar 27 23:51:17 avi-proxmox-infra-ix01 pmxcfs[8395]: [dcdb] notice: leader is 1/8280
Mar 27 23:51:17 avi-proxmox-infra-ix01 pmxcfs[8395]: [dcdb] notice: synced members: 1/8280, 2/312011, 3/8268, 5/7515, 6/7216, 7/7306, 8/8102, 9/8279, 10/8268, 14/7314, 15/7211, 16/7201, 20/8395, 21/8894, 24/8406, 25/8160, 26/8474, 29/8163, 30/8014, 34/11715, 35/11616, 36/10240, 37/11535
Mar 27 23:51:17 avi-proxmox-infra-ix01 pmxcfs[8395]: [dcdb] notice: all data is up to date
Mar 27 23:51:17 avi-proxmox-infra-ix01 pmxcfs[8395]: [dcdb] notice: dfsm_deliver_queue: queue length 91
Probably this behavior was triggered by pve-ha-crm.service, but it's not quite clear why, I'm familiar with something similar in the track,
but I do not fully understand how to avoid such behavior HA, because the failure is not advance work and prepare for them using for example
Code:
for service in $(ha-manager status | grep service | awk '{print $2}'); do ha-manager set $service --state ignored; done
I'll attach some data about my PVE configuration
Code:
sudo corosync-cfgtool -s
Local node ID 20, transport knet
LINK ID 0 udp
addr = 10.208.64.12
status:
nodeid: 1: connected
nodeid: 2: connected
nodeid: 3: connected
nodeid: 4: connected
nodeid: 5: connected
nodeid: 6: connected
nodeid: 7: connected
nodeid: 8: connected
nodeid: 9: connected
nodeid: 10: connected
nodeid: 11: connected
nodeid: 12: connected
nodeid: 13: connected
nodeid: 14: connected
nodeid: 15: connected
nodeid: 16: connected
nodeid: 17: connected
nodeid: 18: connected
nodeid: 19: connected
nodeid: 20: localhost
nodeid: 21: connected
nodeid: 22: connected
nodeid: 23: connected
nodeid: 24: connected
nodeid: 25: connected
nodeid: 26: connected
nodeid: 27: connected
nodeid: 28: connected
nodeid: 29: connected
nodeid: 30: connected
nodeid: 31: connected
nodeid: 32: connected
nodeid: 33: connected
nodeid: 34: connected
nodeid: 35: connected
nodeid: 36: connected
nodeid: 37: connected