HA's decision to reboot all hypervisors of a cluster that came together after 1/3 of the hypervisors failed.

yayupokrepov · Apr 10, 2025

Hi all, we are facing a problem which is HA behavior, the other day we lost network connectivity with one of the three datacenters which was hosting 1/3 of the hypervisors from the cluster, the cluster itself consists of 37 hypervisors. Expectedly 1/3 group of failed hypervisors tried to form a quorum, but unsuccessfully

Code:

Mar 27 23:51:16 avi-proxmox-infra-sd01 corosync[323214]:   [QUORUM] Sync members[14]: 4 11 12 13 17 18 19 22 23 27 28 31 32 33
Mar 27 23:51:16 avi-proxmox-infra-sd01 corosync[323214]:   [QUORUM] Sync left[23]: 1 2 3 5 6 7 8 9 10 14 15 16 20 21 24 25 26 29 30 34 35 36 37
Mar 27 23:51:16 avi-proxmox-infra-sd01 corosync[323214]:   [TOTEM ] A new membership (4.5c3a) was formed. Members left: 1 2 3 5 6 7 8 9 10 14 15 16 20 21 24 25 26 29 30 34 35 36 37
Mar 27 23:51:16 avi-proxmox-infra-sd01 corosync[323214]:   [TOTEM ] Failed to receive the leave message. failed: 1 2 3 5 6 7 8 9 10 14 15 16 20 21 24 25 26 29 30 34 35 36 37
Mar 27 23:51:16 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] notice: members: 4/8686, 11/9484, 12/8262, 13/339019, 17/8986, 18/9005, 19/9035, 22/8632, 23/8125, 27/8068, 28/8339, 31/8036, 32/11695, 33/11671
Mar 27 23:51:16 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] notice: starting data syncronisation
Mar 27 23:51:16 avi-proxmox-infra-sd01 pmxcfs[9484]: [status] notice: members: 4/8686, 11/9484, 12/8262, 13/339019, 17/8986, 18/9005, 19/9035, 22/8632, 23/8125, 27/8068, 28/8339, 31/8036, 32/11695, 33/11671
Mar 27 23:51:16 avi-proxmox-infra-sd01 pmxcfs[9484]: [status] notice: starting data syncronisation
Mar 27 23:51:16 avi-proxmox-infra-sd01 corosync[323214]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Mar 27 23:51:16 avi-proxmox-infra-sd01 corosync[323214]:   [QUORUM] Members[14]: 4 11 12 13 17 18 19 22 23 27 28 31 32 33
Mar 27 23:51:16 avi-proxmox-infra-sd01 corosync[323214]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 27 23:51:16 avi-proxmox-infra-sd01 pmxcfs[9484]: [status] notice: node lost quorum
Mar 27 23:51:16 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] notice: received sync request (epoch 4/8686/0000000A)
Mar 27 23:51:16 avi-proxmox-infra-sd01 pmxcfs[9484]: [status] notice: received sync request (epoch 4/8686/0000000A)
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] notice: received all states
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] notice: leader is 4/8686
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] notice: synced members: 4/8686, 11/9484, 12/8262, 13/339019, 17/8986, 18/9005, 19/9035, 22/8632, 23/8125, 27/8068, 28/8339, 31/8036, 32/11695, 33/11671
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] notice: all data is up to date
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] notice: dfsm_deliver_queue: queue length 56
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] crit: received write while not quorate - trigger resync
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] crit: leaving CPG group
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [status] notice: received all states
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [status] notice: all data is up to date
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [status] notice: dfsm_deliver_queue: queue length 162
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] notice: start cluster connection
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] crit: cpg_join failed: 14
Mar 27 23:51:17 avi-proxmox-infra-sd01 pmxcfs[9484]: [dcdb] crit: can't initialize service
Mar 27 23:51:17 avi-proxmox-infra-sd01 pve-ha-lrm[22747]: lost lock 'ha_agent_avi-proxmox-infra-sd01_lock - cfs lock update failed - Device or resource busy
Mar 27 23:51:17 avi-proxmox-infra-sd01 pve-ha-lrm[22747]: status change active => lost_agent_lock
Mar 27 23:51:17 avi-proxmox-infra-sd01 pvescheduler[546533]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Mar 27 23:51:17 avi-proxmox-infra-sd01 pvescheduler[546532]: replication: cfs-lock 'file-replication_cfg' error: no quorum!

and 2/3 of the cluster successfully formed a quorum, but for some inexplicable reason, HA decided to reboot all nodes of the newly assembled cluster and only then continue to function normally, which caused an inevitable crash of all running VMs in the entire cluster for about ~7min

Code:

Mar 27 23:51:16 avi-proxmox-infra-ix01 corosync[188611]:   [QUORUM] Sync members[23]: 1 2 3 5 6 7 8 9 10 14 15 16 20 21 24 25 26 29 30 34 35 36 37
Mar 27 23:51:16 avi-proxmox-infra-ix01 corosync[188611]:   [QUORUM] Sync left[14]: 4 11 12 13 17 18 19 22 23 27 28 31 32 33
Mar 27 23:51:16 avi-proxmox-infra-ix01 corosync[188611]:   [TOTEM ] A new membership (1.5c3a) was formed. Members left: 4 11 12 13 17 18 19 22 23 27 28 31 32 33
Mar 27 23:51:16 avi-proxmox-infra-ix01 corosync[188611]:   [TOTEM ] Failed to receive the leave message. failed: 4 11 12 13 17 18 19 22 23 27 28 31 32 33
Mar 27 23:51:16 avi-proxmox-infra-ix01 pmxcfs[8395]: [dcdb] notice: members: 1/8280, 2/312011, 3/8268, 5/7515, 6/7216, 7/7306, 8/8102, 9/8279, 10/8268, 14/7314, 15/7211, 16/7201, 20/8395, 21/8894, 24/8406, 25/8160, 26/8474, 29/8163, 30/8014, 34/11715, 35/11616, 36/10240, 37/11535
Mar 27 23:51:16 avi-proxmox-infra-ix01 pmxcfs[8395]: [dcdb] notice: starting data syncronisation
Mar 27 23:51:16 avi-proxmox-infra-ix01 pmxcfs[8395]: [status] notice: members: 1/8280, 2/312011, 3/8268, 5/7515, 6/7216, 7/7306, 8/8102, 9/8279, 10/8268, 14/7314, 15/7211, 16/7201, 20/8395, 21/8894, 24/8406, 25/8160, 26/8474, 29/8163, 30/8014, 34/11715, 35/11616, 36/10240, 37/11535
Mar 27 23:51:16 avi-proxmox-infra-ix01 pmxcfs[8395]: [status] notice: starting data syncronisation
Mar 27 23:51:16 avi-proxmox-infra-ix01 corosync[188611]:   [QUORUM] Members[23]: 1 2 3 5 6 7 8 9 10 14 15 16 20 21 24 25 26 29 30 34 35 36 37
Mar 27 23:51:16 avi-proxmox-infra-ix01 corosync[188611]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 27 23:51:17 avi-proxmox-infra-ix01 pmxcfs[8395]: [dcdb] notice: received sync request (epoch 1/8280/00000025)
Mar 27 23:51:17 avi-proxmox-infra-ix01 pmxcfs[8395]: [status] notice: received sync request (epoch 1/8280/00000025)
Mar 27 23:51:17 avi-proxmox-infra-ix01 collectd[651285]: ntpoffset plugin: failed to read offset from 10.160.82.31
Mar 27 23:51:17 avi-proxmox-infra-ix01 pmxcfs[8395]: [dcdb] notice: received all states
Mar 27 23:51:17 avi-proxmox-infra-ix01 pmxcfs[8395]: [dcdb] notice: leader is 1/8280
Mar 27 23:51:17 avi-proxmox-infra-ix01 pmxcfs[8395]: [dcdb] notice: synced members: 1/8280, 2/312011, 3/8268, 5/7515, 6/7216, 7/7306, 8/8102, 9/8279, 10/8268, 14/7314, 15/7211, 16/7201, 20/8395, 21/8894, 24/8406, 25/8160, 26/8474, 29/8163, 30/8014, 34/11715, 35/11616, 36/10240, 37/11535
Mar 27 23:51:17 avi-proxmox-infra-ix01 pmxcfs[8395]: [dcdb] notice: all data is up to date
Mar 27 23:51:17 avi-proxmox-infra-ix01 pmxcfs[8395]: [dcdb] notice: dfsm_deliver_queue: queue length 91

Probably this behavior was triggered by pve-ha-crm.service, but it's not quite clear why, I'm familiar with something similar in the track,
but I do not fully understand how to avoid such behavior HA, because the failure is not advance work and prepare for them using for example

Code:

for service in $(ha-manager status | grep service | awk '{print $2}'); do ha-manager set $service --state ignored; done

useless

I'll attach some data about my PVE configuration

Code:

sudo corosync-cfgtool -s
Local node ID 20, transport knet
LINK ID 0 udp
    addr    = 10.208.64.12
    status:
        nodeid:          1:    connected
        nodeid:          2:    connected
        nodeid:          3:    connected
        nodeid:          4:    connected
        nodeid:          5:    connected
        nodeid:          6:    connected
        nodeid:          7:    connected
        nodeid:          8:    connected
        nodeid:          9:    connected
        nodeid:         10:    connected
        nodeid:         11:    connected
        nodeid:         12:    connected
        nodeid:         13:    connected
        nodeid:         14:    connected
        nodeid:         15:    connected
        nodeid:         16:    connected
        nodeid:         17:    connected
        nodeid:         18:    connected
        nodeid:         19:    connected
        nodeid:         20:    localhost
        nodeid:         21:    connected
        nodeid:         22:    connected
        nodeid:         23:    connected
        nodeid:         24:    connected
        nodeid:         25:    connected
        nodeid:         26:    connected
        nodeid:         27:    connected
        nodeid:         28:    connected
        nodeid:         29:    connected
        nodeid:         30:    connected
        nodeid:         31:    connected
        nodeid:         32:    connected
        nodeid:         33:    connected
        nodeid:         34:    connected
        nodeid:         35:    connected
        nodeid:         36:    connected
        nodeid:         37:    connected

fabian · Apr 10, 2025

you mention one of three datacenters - how does your network setup and corosync config look like? please also check the logs on other nodes were part of the "bigger" part of the cluster..

yayupokrepov · Apr 10, 2025

the corosinc configuration looks like this:

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: avi-proxmox-infra-ef01
    nodeid: 10
    quorum_votes: 1
    ring0_addr: 10.104.66.30
  }
  node {
    name: avi-proxmox-infra-ef02
    nodeid: 9
    quorum_votes: 1
    ring0_addr: 10.96.65.168
  }
  node {
    name: avi-proxmox-infra-ef03
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.104.66.173
  }
  node {
    name: avi-proxmox-infra-ef04
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.96.65.212
  }
  node {
    name: avi-proxmox-infra-ef05
    nodeid: 14
    quorum_votes: 1
    ring0_addr: 10.88.64.199
  }
  node {
    name: avi-proxmox-infra-ef06
    nodeid: 15
    quorum_votes: 1
    ring0_addr: 10.88.64.209
  }
  node {
    name: avi-proxmox-infra-ef07
    nodeid: 16
    quorum_votes: 1
    ring0_addr: 10.88.65.8
  }
  node {
    name: avi-proxmox-infra-ef08
    nodeid: 21
    quorum_votes: 1
    ring0_addr: 10.88.65.149
  }
  node {
    name: avi-proxmox-infra-ef09
    nodeid: 24
    quorum_votes: 1
    ring0_addr: 10.112.66.70
  }
  node {
    name: avi-proxmox-infra-ef10
    nodeid: 30
    quorum_votes: 1
    ring0_addr: 10.120.64.5
  }
  node {
    name: avi-proxmox-infra-ef11
    nodeid: 36
    quorum_votes: 1
    ring0_addr: 10.120.5.166
  }
  node {
    name: avi-proxmox-infra-ef12
    nodeid: 37
    quorum_votes: 1
    ring0_addr: 10.120.5.170
  }
  node {
    name: avi-proxmox-infra-ix01
    nodeid: 20
    quorum_votes: 1
    ring0_addr: 10.208.64.12
  }
  node {
    name: avi-proxmox-infra-ix02
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.184.65.141
  }
  node {
    name: avi-proxmox-infra-ix03
    nodeid: 26
    quorum_votes: 1
    ring0_addr: 10.208.64.13
  }
  node {
    name: avi-proxmox-infra-ix04
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.64.65.151
  }
  node {
    name: avi-proxmox-infra-ix05
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.64.65.150
  }
  node {
    name: avi-proxmox-infra-ix06
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 10.64.65.148
  }
  node {
    name: avi-proxmox-infra-ix07
    nodeid: 8
    quorum_votes: 1
    ring0_addr: 10.72.65.163
  }
  node {
    name: avi-proxmox-infra-ix08
    nodeid: 25
    quorum_votes: 1
    ring0_addr: 10.192.66.70
  }
  node {
    name: avi-proxmox-infra-ix09
    nodeid: 29
    quorum_votes: 1
    ring0_addr: 10.184.65.7
  }
  node {
    name: avi-proxmox-infra-ix10
    nodeid: 34
    quorum_votes: 1
    ring0_addr: 10.208.30.178
  }
  node {
    name: avi-proxmox-infra-ix11
    nodeid: 35
    quorum_votes: 1
    ring0_addr: 10.208.21.82
  }
  node {
    name: avi-proxmox-infra-sd01
    nodeid: 11
    quorum_votes: 1
    ring0_addr: 10.152.65.170
  }
  node {
    name: avi-proxmox-infra-sd02
    nodeid: 12
    quorum_votes: 1
    ring0_addr: 10.152.65.169
  }
  node {
    name: avi-proxmox-infra-sd03
    nodeid: 13
    quorum_votes: 1
    ring0_addr: 10.152.65.168
  }
  node {
    name: avi-proxmox-infra-sd04
    nodeid: 19
    quorum_votes: 1
    ring0_addr: 10.168.64.239
  }
  node {
    name: avi-proxmox-infra-sd05
    nodeid: 18
    quorum_votes: 1
    ring0_addr: 10.168.64.240
  }
  node {
    name: avi-proxmox-infra-sd06
    nodeid: 17
    quorum_votes: 1
    ring0_addr: 10.168.64.241
  }
  node {
    name: avi-proxmox-infra-sd07
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.168.65.221
  }
  node {
    name: avi-proxmox-infra-sd08
    nodeid: 22
    quorum_votes: 1
    ring0_addr: 10.168.64.138
  }
  node {
    name: avi-proxmox-infra-sd09
    nodeid: 23
    quorum_votes: 1
    ring0_addr: 10.136.64.164
  }
  node {
    name: avi-proxmox-infra-sd10
    nodeid: 28
    quorum_votes: 1
    ring0_addr: 10.160.65.146
  }
  node {
    name: avi-proxmox-infra-sd11
    nodeid: 27
    quorum_votes: 1
    ring0_addr: 10.160.65.147
  }
  node {
    name: avi-proxmox-infra-sd12
    nodeid: 31
    quorum_votes: 1
    ring0_addr: 10.144.65.105
  }
  node {
    name: avi-proxmox-infra-sd13
    nodeid: 32
    quorum_votes: 1
    ring0_addr: 10.144.39.74
  }
  node {
    name: avi-proxmox-infra-sd14
    nodeid: 33
    quorum_votes: 1
    ring0_addr: 10.144.38.166
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: infra01
  config_version: 122
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

the network configuration is quite large due to the number of vmbr, I will attach the beginning and the end

Code:

auto lo
iface lo inet loopback

iface eno1np0 inet manual

iface eno2np1 inet manual

iface eno3 inet manual

iface eno4 inet manual

auto vmbr1
iface vmbr1 inet static
    address 10.0.0.200/31
    bridge-ports none
    bridge-stp off
    bridge-fd 0

    up ip link add veth31 type veth && brctl addif vmbr1 veth31 && brctl delif vmbr1 veth31 && ip link delete veth31

auto vmbr2
iface vmbr2 inet static
    address 10.0.0.174/31
    bridge-ports none
    bridge-stp off
    bridge-fd 0

    up ip link add veth32 type veth && brctl addif vmbr2 veth32 && brctl delif vmbr2 veth32 && ip link delete veth32

auto vmbr3
iface vmbr3 inet static
    address 10.0.0.168/31
    bridge-ports none
    bridge-stp off
    bridge-fd 0

    up ip link add veth33 type veth && brctl addif vmbr3 veth33 && brctl delif vmbr3 veth33 && ip link delete veth33


auto vmbr172
iface vmbr172 inet static
    address 10.0.15.110/31
    bridge-ports none
    bridge-stp off
    bridge-fd 0

auto vmbr0
iface vmbr0 inet static
    address 10.0.0.28/31
    bridge-ports none
    bridge-stp off
    bridge-fd 0

auto vmbr204
iface vmbr204 inet static
    address 10.0.15.112/31
    bridge-ports none
    bridge-stp off
    bridge-fd 0

auto vmbr206
iface vmbr206 inet static
    address 10.0.15.114/31
    bridge-ports none
    bridge-stp off
    bridge-fd 0

auto vmbr208
iface vmbr208 inet static
    address 10.0.15.116/31
    bridge-ports none
    bridge-stp off
    bridge-fd 0

auto vmbr261
iface vmbr261 inet static
    address 10.0.15.118/31
    bridge-ports none
    bridge-stp off
    bridge-fd 0

ip -4 a

Code:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 10.208.64.12/32 brd 10.208.64.12 scope global lo:1
       valid_lft forever preferred_lft forever
    inet 169.254.100.100/32 brd 169.254.100.100 scope global lo:100
       valid_lft forever preferred_lft forever
258: vmbr101: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    inet 10.0.2.212/31 scope global vmbr101
       valid_lft forever preferred_lft forever
2: eno1np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    altname enp25s0f0np0
    inet 10.208.3.54/30 scope global eno1np0
       valid_lft forever preferred_lft forever
4: eno2np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    altname enp25s0f1np1
    inet 10.208.2.54/30 scope global eno2np1
       valid_lft forever preferred_lft forever
773: vmbr206: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    inet 10.0.15.114/31 scope global vmbr206
       valid_lft forever preferred_lft forever
261: vmbr102: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    inet 10.0.2.214/31 scope global vmbr102
       valid_lft forever preferred_lft forever
6: vmbr1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    inet 10.0.0.200/31 scope global vmbr1
       valid_lft forever preferred_lft forever
264: vmbr103: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    inet 10.0.2.216/31 scope global vmbr103
       valid_lft forever preferred_lft forever

etc...

Regarding logs from the other node of the larger part of the cluster, do you need a full log of the start of the event and +-20 minutes after?

fabian · Apr 14, 2025

a little bit before and a little bit after. most important would be corosync, pve-cluster, pve-ha-lrm , pve-ha-crm, watchdog-mux and messages from the kernel

yayupokrepov · Apr 14, 2025

Unfortunately there are no logs left in journalctl, but here is what I could find in the syslog on the hypervisor of the larger part of the cluster
* corosync and kernel logs are attached as files because they are large.

pve-cluster:

Code:

/var/log/syslog:Mar 27 23:55:10 avi-proxmox-infra-ix03 systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
/var/log/syslog:Mar 27 23:55:11 avi-proxmox-infra-ix03 systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.

pve-ha-lrm:

Code:

/var/log/syslog:Mar 27 23:57:07 avi-proxmox-infra-ix03 pve-ha-lrm[16416]: successfully acquired lock 'ha_agent_avi-proxmox-infra-ix03_lock'
/var/log/syslog:Mar 27 23:57:07 avi-proxmox-infra-ix03 pve-ha-lrm[16416]: watchdog active
/var/log/syslog:Mar 27 23:57:07 avi-proxmox-infra-ix03 pve-ha-lrm[16416]: status change wait_for_agent_lock => active
/var/log/syslog:Mar 27 23:57:08 avi-proxmox-infra-ix03 pve-ha-lrm[27859]: starting service vm:102
/var/log/syslog:Mar 27 23:57:08 avi-proxmox-infra-ix03 pve-ha-lrm[27860]: starting service vm:127
/var/log/syslog:Mar 27 23:57:08 avi-proxmox-infra-ix03 pve-ha-lrm[27861]: starting service vm:135
/var/log/syslog:Mar 27 23:57:08 avi-proxmox-infra-ix03 pve-ha-lrm[27862]: starting service vm:153
/var/log/syslog:Mar 27 23:57:08 avi-proxmox-infra-ix03 pve-ha-lrm[27863]: start VM 102: UPID:avi-proxmox-infra-ix03:00006CD7:00003A6D:67E5BBA4:qmstart:102:root@pam:
/var/log/syslog:Mar 27 23:57:08 avi-proxmox-infra-ix03 pve-ha-lrm[27859]: <root@pam> starting task UPID:avi-proxmox-infra-ix03:00006CD7:00003A6D:67E5BBA4:qmstart:102:root@pam:
/var/log/syslog:Mar 27 23:57:08 avi-proxmox-infra-ix03 pve-ha-lrm[27864]: start VM 153: UPID:avi-proxmox-infra-ix03:00006CD8:00003A6D:67E5BBA4:qmstart:153:root@pam:
/var/log/syslog:Mar 27 23:57:08 avi-proxmox-infra-ix03 pve-ha-lrm[27866]: start VM 127: UPID:avi-proxmox-infra-ix03:00006CDA:00003A6D:67E5BBA4:qmstart:127:root@pam:
/var/log/syslog:Mar 27 23:57:08 avi-proxmox-infra-ix03 pve-ha-lrm[27862]: <root@pam> starting task UPID:avi-proxmox-infra-ix03:00006CD8:00003A6D:67E5BBA4:qmstart:153:root@pam:
/var/log/syslog:Mar 27 23:57:08 avi-proxmox-infra-ix03 pve-ha-lrm[27860]: <root@pam> starting task UPID:avi-proxmox-infra-ix03:00006CDA:00003A6D:67E5BBA4:qmstart:127:root@pam:
/var/log/syslog:Mar 27 23:57:08 avi-proxmox-infra-ix03 pve-ha-lrm[27865]: start VM 135: UPID:avi-proxmox-infra-ix03:00006CD9:00003A6D:67E5BBA4:qmstart:135:root@pam:
/var/log/syslog:Mar 27 23:57:08 avi-proxmox-infra-ix03 pve-ha-lrm[27861]: <root@pam> starting task UPID:avi-proxmox-infra-ix03:00006CD9:00003A6D:67E5BBA4:qmstart:135:root@pam:
/var/log/syslog:Mar 27 23:57:12 avi-proxmox-infra-ix03 pve-ha-lrm[27860]: <root@pam> end task UPID:avi-proxmox-infra-ix03:00006CDA:00003A6D:67E5BBA4:qmstart:127:root@pam: OK
/var/log/syslog:Mar 27 23:57:12 avi-proxmox-infra-ix03 pve-ha-lrm[27860]: service status vm:127 started
/var/log/syslog:Mar 27 23:57:12 avi-proxmox-infra-ix03 pve-ha-lrm[30275]: starting service vm:190
/var/log/syslog:Mar 27 23:57:12 avi-proxmox-infra-ix03 pve-ha-lrm[30278]: start VM 190: UPID:avi-proxmox-infra-ix03:00007646:00003C48:67E5BBA8:qmstart:190:root@pam:
/var/log/syslog:Mar 27 23:57:12 avi-proxmox-infra-ix03 pve-ha-lrm[30275]: <root@pam> starting task UPID:avi-proxmox-infra-ix03:00007646:00003C48:67E5BBA8:qmstart:190:root@pam:
/var/log/syslog:Mar 27 23:57:13 avi-proxmox-infra-ix03 pve-ha-lrm[27859]: Task 'UPID:avi-proxmox-infra-ix03:00006CD7:00003A6D:67E5BBA4:qmstart:102:root@pam:' still active, waiting
/var/log/syslog:Mar 27 23:57:13 avi-proxmox-infra-ix03 pve-ha-lrm[27862]: Task 'UPID:avi-proxmox-infra-ix03:00006CD8:00003A6D:67E5BBA4:qmstart:153:root@pam:' still active, waiting
/var/log/syslog:Mar 27 23:57:13 avi-proxmox-infra-ix03 pve-ha-lrm[27861]: Task 'UPID:avi-proxmox-infra-ix03:00006CD9:00003A6D:67E5BBA4:qmstart:135:root@pam:' still active, waiting
/var/log/syslog:Mar 27 23:57:13 avi-proxmox-infra-ix03 pve-ha-lrm[27859]: <root@pam> end task UPID:avi-proxmox-infra-ix03:00006CD7:00003A6D:67E5BBA4:qmstart:102:root@pam: OK
/var/log/syslog:Mar 27 23:57:13 avi-proxmox-infra-ix03 pve-ha-lrm[27861]: <root@pam> end task UPID:avi-proxmox-infra-ix03:00006CD9:00003A6D:67E5BBA4:qmstart:135:root@pam: OK
/var/log/syslog:Mar 27 23:57:13 avi-proxmox-infra-ix03 pve-ha-lrm[27859]: service status vm:102 started
/var/log/syslog:Mar 27 23:57:13 avi-proxmox-infra-ix03 pve-ha-lrm[27861]: service status vm:135 started
/var/log/syslog:Mar 27 23:57:13 avi-proxmox-infra-ix03 pve-ha-lrm[30968]: starting service vm:258
/var/log/syslog:Mar 27 23:57:13 avi-proxmox-infra-ix03 pve-ha-lrm[30973]: start VM 258: UPID:avi-proxmox-infra-ix03:000078FD:00003CAA:67E5BBA9:qmstart:258:root@pam:
/var/log/syslog:Mar 27 23:57:13 avi-proxmox-infra-ix03 pve-ha-lrm[30968]: <root@pam> starting task UPID:avi-proxmox-infra-ix03:000078FD:00003CAA:67E5BBA9:qmstart:258:root@pam:
/var/log/syslog:Mar 27 23:57:13 avi-proxmox-infra-ix03 pve-ha-lrm[27862]: <root@pam> end task UPID:avi-proxmox-infra-ix03:00006CD8:00003A6D:67E5BBA4:qmstart:153:root@pam: OK
/var/log/syslog:Mar 27 23:57:13 avi-proxmox-infra-ix03 pve-ha-lrm[27862]: service status vm:153 started
/var/log/syslog:Mar 27 23:57:14 avi-proxmox-infra-ix03 pve-ha-lrm[30275]: <root@pam> end task UPID:avi-proxmox-infra-ix03:00007646:00003C48:67E5BBA8:qmstart:190:root@pam: OK
/var/log/syslog:Mar 27 23:57:14 avi-proxmox-infra-ix03 pve-ha-lrm[30275]: service status vm:190 started
/var/log/syslog:Mar 27 23:57:17 avi-proxmox-infra-ix03 pve-ha-lrm[30968]: <root@pam> end task UPID:avi-proxmox-infra-ix03:000078FD:00003CAA:67E5BBA9:qmstart:258:root@pam: OK
/var/log/syslog:Mar 27 23:57:17 avi-proxmox-infra-ix03 pve-ha-lrm[30968]: service status vm:258 started

pve-ha-crm:

Code:

/var/log/syslog:Mar 27 23:51:17 avi-proxmox-infra-ix03 pve-ha-crm[11490]: loop take too long (58 seconds)
/var/log/syslog:Mar 27 23:55:12 avi-proxmox-infra-ix03 systemd[1]: Starting pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon...
/var/log/syslog:Mar 27 23:55:13 avi-proxmox-infra-ix03 pve-ha-crm[11231]: starting server
/var/log/syslog:Mar 27 23:55:13 avi-proxmox-infra-ix03 pve-ha-crm[11231]: status change startup => wait_for_quorum
/var/log/syslog:Mar 27 23:55:13 avi-proxmox-infra-ix03 systemd[1]: Started pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon.
/var/log/syslog:Mar 27 23:55:43 avi-proxmox-infra-ix03 pve-ha-crm[11231]: status change wait_for_quorum => slave

watchdog-mux:

Code:

/var/log/syslog:Mar 27 23:51:13 avi-proxmox-infra-ix03 watchdog-mux[1671]: client watchdog expired - disable watchdog updates
/var/log/syslog:Mar 27 23:51:17 avi-proxmox-infra-ix03 watchdog-mux[1671]: exit watchdog-mux with active connections
/var/log/syslog:Mar 27 23:51:17 avi-proxmox-infra-ix03 systemd[1]: watchdog-mux.service: Deactivated successfully.
/var/log/syslog:Mar 27 23:51:17 avi-proxmox-infra-ix03 systemd[1]: watchdog-mux.service: Consumed 2min 25.125s CPU time.
/var/log/syslog:Mar 27 23:54:44 avi-proxmox-infra-ix03 systemd[1]: Started watchdog-mux.service - Proxmox VE watchdog multiplexer.
/var/log/syslog:Mar 27 23:54:44 avi-proxmox-infra-ix03 watchdog-mux[1746]: Watchdog driver 'Software Watchdog', version 0

fabian · Apr 14, 2025

those logs unfortunately don't really contain any actionable details..

yayupokrepov · Apr 14, 2025

I thought the syslog would be quite informative, can you please tell me what logs would help to figure this out?

fabian · Apr 14, 2025

it only shows the first set of nodes going down, but nothing else about the time period afterwards until the fencing.. the journal should normally contain those lines (corosync trying to re-establish quorum or tokens being lost, etc.pp.)

yayupokrepov · Apr 14, 2025

I found the most detailed log of what was happening from the smaller part of the hypervisors that lost connection with the main group and the system logs from the main group, I attach them as files. There are messages about corosync attempts to restore the cluster

fabian · Apr 15, 2025

unfortunately that doesn't really tell us much more..

looking at the quorate partition:

23:50:20: links start going down
23:50:31: last link goes down
23:50:39: token not being received for 20s is logged
23:51:13: watchdog expired
23:51:16: membership was formed
23:51:17: synced up
fenced

some things which are peculiar/of interest:
- your cluster is very big (37 nodes)
- the token timeout is quite large as a result (3000 + (37-2) x 650 = 25750ms = ~26s), which means the cluster only has two chances of reestablishing a membership before being fenced (the HA watchdog is set to 60s)
- it takes a long time for links to actually go down (corosync will send four heartbeat pings per token interval, which means it only sends one every 6.4s and waits almost 13s for a response)
- your network setup looks peculiar (single corosync link bound to lo - could you give more details about that?)

you might want to consider:
- adding a second link and ensuring the links are not saturated by other traffic causing latency spikes
- tuning knet_ping_interval and knet_ping_timeout to detect links going down or up again faster (but don't go too low or you will have frequent link flappings)
- tuning token and/or token_coeffecient to allow re-establishing memberships quicker (but don't go too low, else you risk affecting regular operations as well)

you probably want to disarm HA while playing around with corosync settings to avoid another fence event.

yayupokrepov · Apr 15, 2025

Thanks for the detailed interpretation of the situation from your side!

In our network configuration, in addition to the main interface which is on the server loopback, there are 2 emergency interfaces connected to the server (legs). This is one of the main ways to reserve access to an address over 2 routed links when dynamic routing is in place.

I think a good idea is to add a second link to one of these interfaces. Just 2 emergency interfaces are most often unloaded from traffic.
I will also change the corosync configuration by adding:

Code:

#accelerating the detection of falling knots
knet_ping_interval: 100
knet_ping_timeout: 500
#reduce cluster recovery time, which will give 6 attempts to recover membership before it is fenced by HA
token: 1000
token_coefficient: 250

In the near future we are planning another outage of one of the three data centers and then I will be able to provide a more detailed log of what is happening with the new edits.

I wanted to ask if there is a limit to the number of hypervisors in a cluster, with the average data transfer rate between nodes ~ 9.03 Gbits/sec and maybe the best solution would be to split such a large cluster into 2 smaller ones?

fabian · Apr 15, 2025

it would probably be better to have a single, smaller cluster per datacenter, and then use the remote migrate feature or PDM to handle transferring guests from one cluster to the others if needed.

VictorSTS · Apr 15, 2025

What's the latency from each DC to the other two?

yayupokrepov · Apr 16, 2025

Not more than 0.635 ms

yayupokrepov · May 28, 2025

Getting back to the conversation, to top it all off we decided to recompile watchdog-mux, increasing the wait timer to 120 seconds, hopefully this won't break any proxmox logic

fabian · May 28, 2025

it definitely does, the 60s are assumed by the HA stack for recovery purposes..

Search

Search

HA's decision to reboot all hypervisors of a cluster that came together after 1/3 of the hypervisors failed.

yayupokrepov

Member

fabian

Proxmox Staff Member

yayupokrepov

Member

fabian

Proxmox Staff Member

yayupokrepov

Member

Attachments

fabian

Proxmox Staff Member

yayupokrepov

Member

fabian

Proxmox Staff Member

yayupokrepov

Member

Attachments

fabian

Proxmox Staff Member

yayupokrepov

Member

fabian

Proxmox Staff Member

VictorSTS

Distinguished Member

yayupokrepov

Member

yayupokrepov

Member

fabian

Proxmox Staff Member

We value your privacy