Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

dlasher · Oct 4, 2022

Someone please explain to me why the loss of a single ring should force the entire cluster (9 hosts) to reboot?

Topology - isn't 4 rings enough??

ring0_addr: 10.4.5.0/24 -- eth0/bond0 - switch1 (1ge)
ring1_addr: 198.18.50.0/24 -- eth1/bond1 - switch2 (1ge)
ring2_addr: 198.18.51.0/24 -- eth11&12/bond11 - switch3 & switch4 (2x10ge)
ring3_addr: 198.18.53.0/24 -- eth11&12/bond11 - switch3 & switch4 (2x10ge)

Code upgraded switch1, and the entire cluster reboots.

23:39:20 up 16 min, 1 user, load average: 0.93, 1.49, 1.50
23:39:20 up 17 min, 1 user, load average: 2.90, 1.82, 1.68
23:39:21 up 17 min, 1 user, load average: 3.10, 3.02, 2.71
23:39:21 up 17 min, 1 user, load average: 7.52, 6.50, 4.41
23:39:21 up 17 min, 1 user, load average: 1.46, 1.39, 1.69
(you get the idea)

Shouldn't I be able to power off switch1 and switch2, and the cluster still keep sync? Until Ring3 is gone, we should be happy, yes?

fabian · Oct 4, 2022

please provide the logs of corosync and pve-cluster up to the fence event (e.g., journalctl -u corosync -u pve-cluster --since "....") and your corosync.conf

dlasher · Oct 4, 2022

fabian said:
please provide the logs of corosync and pve-cluster up to the fence event (e.g., journalctl -u corosync -u pve-cluster --since "....") and your corosync.conf

Will dig them out tonight, thanks.

From an operational standpoint, is there any way to tweak the behavior of fencing?

For example, this cluster has CEPH, and as long as CEPH is happy, I'm fine with all the VM's being shut down, but by all means, don't ()*@#$ reboot!!! It's easily 20 minutes from the time the cluster panics, until all the nodes are back up, ceph is happy, the VM's are restarted, etc, so it's incredibly disruptive. (Not to mention that while CEPHFS is getting happy, HA gets ticked off, and ends up leaving some VM's not started)

dlasher · Oct 4, 2022

Here's logs from a (7) node cluster, this is from node 1 - I'm notice that there's nothing in the logs that explicitly say "hey, we've failed, I'm rebooting" so I hope this makes sense to you @fabian .

I read this as "lost 0, lost 1, 2 is fine, we shuffle a bit to make 2 happy, then pull the ripcord"

Code:

Oct 03 23:17:58 pmx1 corosync[7047]:   [TOTEM ] Token has not been received in 4687 ms
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] link: host: 1 link: 0 is down
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] link: host: 1 link: 1 is down
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] link: host: 7 link: 0 is down
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] link: host: 7 link: 1 is down
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] link: host: 2 link: 0 is down
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] link: host: 2 link: 1 is down
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] link: host: 3 link: 0 is down
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] link: host: 3 link: 1 is down
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] link: host: 4 link: 0 is down
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] link: host: 4 link: 1 is down
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] link: host: 5 link: 0 is down
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] link: host: 5 link: 1 is down
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] host: host: 1 (passive) best link: 2 (pri: 1)
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] host: host: 1 (passive) best link: 2 (pri: 1)
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] host: host: 7 (passive) best link: 2 (pri: 1)
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] host: host: 7 (passive) best link: 2 (pri: 1)
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] host: host: 2 (passive) best link: 2 (pri: 1)
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] host: host: 2 (passive) best link: 2 (pri: 1)
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] host: host: 3 (passive) best link: 2 (pri: 1)
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] host: host: 3 (passive) best link: 2 (pri: 1)
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] host: host: 4 (passive) best link: 2 (pri: 1)
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] host: host: 4 (passive) best link: 2 (pri: 1)
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] host: host: 5 (passive) best link: 2 (pri: 1)
Oct 03 23:17:58 pmx1 corosync[7047]:   [KNET  ] host: host: 5 (passive) best link: 2 (pri: 1)
Oct 03 23:18:15 pmx1 corosync[7047]:   [QUORUM] Sync members[3]: 1 6 7
Oct 03 23:18:15 pmx1 corosync[7047]:   [QUORUM] Sync left[4]: 2 3 4 5
Oct 03 23:18:15 pmx1 corosync[7047]:   [TOTEM ] A new membership (1.1235) was formed. Members left: 2 3 4 5
Oct 03 23:18:15 pmx1 corosync[7047]:   [TOTEM ] Failed to receive the leave message. failed: 2 3 4 5
Oct 03 23:18:20 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 10
Oct 03 23:18:21 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 20
Oct 03 23:18:22 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 30
Oct 03 23:18:22 pmx1 corosync[7047]:   [QUORUM] Sync members[3]: 1 6 7
Oct 03 23:18:22 pmx1 corosync[7047]:   [QUORUM] Sync left[4]: 2 3 4 5
Oct 03 23:18:22 pmx1 corosync[7047]:   [TOTEM ] A new membership (1.1239) was formed. Members
Oct 03 23:18:23 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 40
Oct 03 23:18:24 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 50
Oct 03 23:18:25 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 60
Oct 03 23:18:26 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 70
Oct 03 23:18:27 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 80
Oct 03 23:18:28 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 90
Oct 03 23:18:29 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 100
Oct 03 23:18:29 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retried 100 times
Oct 03 23:18:29 pmx1 pmxcfs[6759]: [status] crit: cpg_send_message failed: 6
Oct 03 23:18:30 pmx1 corosync[7047]:   [QUORUM] Sync members[3]: 1 6 7
Oct 03 23:18:30 pmx1 corosync[7047]:   [QUORUM] Sync left[4]: 2 3 4 5
Oct 03 23:18:30 pmx1 corosync[7047]:   [TOTEM ] A new membership (1.123d) was formed. Members
Oct 03 23:18:30 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 10
Oct 03 23:18:31 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 20
Oct 03 23:18:32 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 30
Oct 03 23:18:33 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 40
Oct 03 23:18:34 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 50
Oct 03 23:18:35 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 60
Oct 03 23:18:36 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 70
Oct 03 23:18:37 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 80
Oct 03 23:18:37 pmx1 corosync[7047]:   [QUORUM] Sync members[3]: 1 6 7
Oct 03 23:18:37 pmx1 corosync[7047]:   [QUORUM] Sync left[4]: 2 3 4 5
Oct 03 23:18:37 pmx1 corosync[7047]:   [TOTEM ] A new membership (1.1241) was formed. Members
Oct 03 23:18:38 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 90
Oct 03 23:18:39 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 100
Oct 03 23:18:39 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retried 100 times
Oct 03 23:18:39 pmx1 pmxcfs[6759]: [status] crit: cpg_send_message failed: 6
Oct 03 23:18:40 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 10
Oct 03 23:18:41 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 20
Oct 03 23:18:42 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 30
Oct 03 23:18:43 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 40
Oct 03 23:18:44 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 50
Oct 03 23:18:45 pmx1 corosync[7047]:   [QUORUM] Sync members[3]: 1 6 7
Oct 03 23:18:45 pmx1 corosync[7047]:   [QUORUM] Sync left[4]: 2 3 4 5
Oct 03 23:18:45 pmx1 corosync[7047]:   [TOTEM ] A new membership (1.1245) was formed. Members
Oct 03 23:18:45 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 60
Oct 03 23:18:46 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 70
Oct 03 23:18:47 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 80
Oct 03 23:18:48 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 90
Oct 03 23:18:49 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 100
Oct 03 23:18:49 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retried 100 times
Oct 03 23:18:49 pmx1 pmxcfs[6759]: [status] crit: cpg_send_message failed: 6
Oct 03 23:18:50 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 10
Oct 03 23:18:51 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 20
Oct 03 23:18:52 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 30
Oct 03 23:18:53 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 40
Oct 03 23:18:54 pmx1 corosync[7047]:   [QUORUM] Sync members[3]: 1 6 7
Oct 03 23:18:54 pmx1 corosync[7047]:   [QUORUM] Sync left[4]: 2 3 4 5
Oct 03 23:18:54 pmx1 corosync[7047]:   [TOTEM ] A new membership (1.1249) was formed. Members
Oct 03 23:18:54 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 50
Oct 03 23:18:55 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 60
Oct 03 23:18:56 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 70
Oct 03 23:18:57 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 80
Oct 03 23:18:58 pmx1 pmxcfs[6759]: [status] notice: cpg_send_message retry 90
-- Boot 9262569c78f646439a98b6b219627c44 --

^^^ no "i'm rebooting" message - just "hey, welcome back!"

dlasher · Oct 4, 2022

And the conf file.

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pmx1
    nodeid: 6
    quorum_votes: 1
    ring0_addr: 10.4.5.101
    ring1_addr: 198.18.50.101
    ring2_addr: 198.18.51.101
    ring3_addr: 198.18.53.101
  }
  node {
    name: pmx2
    nodeid: 5
    quorum_votes: 1
    ring0_addr: 10.4.5.102
    ring1_addr: 198.18.50.102
    ring2_addr: 198.18.52.102
    ring3_addr: 198.18.53.102
  }
  node {
    name: pmx3
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.4.5.103
    ring1_addr: 198.18.50.103
    ring2_addr: 198.18.51.103
    ring3_addr: 198.18.53.103
  }
  node {
    name: pmx4
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.4.5.104
    ring1_addr: 198.18.50.104
    ring2_addr: 198.18.51.104
    ring3_addr: 198.18.53.104
  }
  node {
    name: pmx5
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.4.5.105
    ring1_addr: 198.18.50.105
    ring2_addr: 198.18.51.105
    ring3_addr: 198.18.53.105
  }
  node {
    name: pmx6
    nodeid: 7
    quorum_votes: 1
    ring0_addr: 10.4.5.106
    ring1_addr: 198.18.50.106
    ring2_addr: 198.18.51.106
    ring3_addr: 198.18.53.106
  }
  node {
    name: pmx7
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.4.5.107
    ring1_addr: 198.18.50.107
    ring2_addr: 198.18.51.107
    ring3_addr: 198.18.53.107
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pmx7home
  config_version: 19
  interface {
    linknumber: 0
    knet_ping_interval: 200
    knet_ping_timeout: 5000
    knet_pong_count: 1
  }
  interface {
    linknumber: 1
    knet_ping_interval: 200
    knet_ping_timeout: 5000
    knet_pong_count: 1
  }
  interface {
    linknumber: 2
    knet_ping_interval: 200
    knet_ping_timeout: 10000
    knet_pong_count: 1
  }
  interface {
    linknumber: 3
    knet_ping_interval: 200
    knet_ping_timeout: 10000
    knet_pong_count: 1
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

dlasher · Oct 4, 2022

For reference, from a topology standpoint, pmx1/2/3/4/5 (nodes 6,5,4,3,2) sit in the same rack, whereas pmx6/7 (nodes 1,7) sit in another room, connected to different switches with shared infra between.

Code:

root@pmx1:~# pveversion
pve-manager/7.2-11/b76d3178 (running kernel: 5.15.39-3-pve)

dlasher · Oct 4, 2022

Here's what the same event looked like from pmx4 (node 3)

Code:

Oct 03 23:17:58 pmx4 corosync[6951]:   [TOTEM ] Token has not been received in 4687 ms
Oct 03 23:17:58 pmx4 corosync[6951]:   [KNET  ] link: host: 6 link: 0 is down
Oct 03 23:17:58 pmx4 corosync[6951]:   [KNET  ] link: host: 6 link: 1 is down
Oct 03 23:17:58 pmx4 corosync[6951]:   [KNET  ] host: host: 6 (passive) best link: 2 (pri: 1)
Oct 03 23:17:58 pmx4 corosync[6951]:   [KNET  ] host: host: 6 (passive) best link: 2 (pri: 1)
Oct 03 23:17:58 pmx4 corosync[6951]:   [KNET  ] link: host: 5 link: 1 is down
Oct 03 23:17:58 pmx4 corosync[6951]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Oct 03 23:17:59 pmx4 corosync[6951]:   [KNET  ] link: host: 4 link: 1 is down
Oct 03 23:17:59 pmx4 corosync[6951]:   [KNET  ] link: host: 5 link: 0 is down
Oct 03 23:17:59 pmx4 corosync[6951]:   [KNET  ] link: host: 1 link: 1 is down
Oct 03 23:17:59 pmx4 corosync[6951]:   [KNET  ] link: host: 7 link: 1 is down
Oct 03 23:17:59 pmx4 corosync[6951]:   [KNET  ] link: host: 2 link: 1 is down
Oct 03 23:17:59 pmx4 corosync[6951]:   [KNET  ] link: host: 4 link: 0 is down
Oct 03 23:17:59 pmx4 corosync[6951]:   [KNET  ] link: host: 1 link: 0 is down
Oct 03 23:17:59 pmx4 corosync[6951]:   [KNET  ] link: host: 7 link: 0 is down
Oct 03 23:17:59 pmx4 corosync[6951]:   [KNET  ] link: host: 2 link: 0 is down
Oct 03 23:18:00 pmx4 corosync[6951]:   [TOTEM ] A processor failed, forming new configuration: token timed out (6250ms), waiting 7500ms for consensus.
Oct 03 23:18:07 pmx4 corosync[6951]:   [QUORUM] Sync members[1]: 3
Oct 03 23:18:07 pmx4 corosync[6951]:   [QUORUM] Sync left[6]: 1 2 4 5 6 7
Oct 03 23:18:07 pmx4 corosync[6951]:   [TOTEM ] A new membership (3.1235) was formed. Members left: 1 2 4 5 6 7
Oct 03 23:18:07 pmx4 corosync[6951]:   [TOTEM ] Failed to receive the leave message. failed: 1 2 4 5 6 7
Oct 03 23:18:07 pmx4 pmxcfs[6750]: [dcdb] notice: members: 3/6750
Oct 03 23:18:07 pmx4 pmxcfs[6750]: [status] notice: members: 3/6750
Oct 03 23:18:07 pmx4 corosync[6951]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Oct 03 23:18:07 pmx4 corosync[6951]:   [QUORUM] Members[1]: 3
Oct 03 23:18:07 pmx4 corosync[6951]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 03 23:18:07 pmx4 pmxcfs[6750]: [status] notice: node lost quorum
Oct 03 23:18:07 pmx4 pmxcfs[6750]: [dcdb] crit: received write while not quorate - trigger resync
Oct 03 23:18:07 pmx4 pmxcfs[6750]: [dcdb] crit: leaving CPG group
Oct 03 23:18:08 pmx4 pmxcfs[6750]: [dcdb] notice: start cluster connection
Oct 03 23:18:08 pmx4 pmxcfs[6750]: [dcdb] crit: cpg_join failed: 14
Oct 03 23:18:08 pmx4 pmxcfs[6750]: [dcdb] crit: can't initialize service
Oct 03 23:18:14 pmx4 pmxcfs[6750]: [dcdb] notice: members: 3/6750
Oct 03 23:18:14 pmx4 pmxcfs[6750]: [dcdb] notice: all data is up to date
Oct 03 23:18:22 pmx4 corosync[6951]:   [QUORUM] Sync members[1]: 3
Oct 03 23:18:22 pmx4 corosync[6951]:   [TOTEM ] A new membership (3.1239) was formed. Members
Oct 03 23:18:22 pmx4 corosync[6951]:   [QUORUM] Members[1]: 3
Oct 03 23:18:22 pmx4 corosync[6951]:   [MAIN  ] Completed service synchronization, ready to provide service.
-- Boot fd66aa8d6be847dda47eff40d279381c --

zero chance ring2/ring3 were not available due to network conditions. Each server has dual 10GE, which goes to dual arista 10GE switches running mLAG. (Used to be another vendor, replaced the switches thinking they were part of the problem, they weren't)

fabian · Oct 5, 2022

pmx2 ring2_addr: 198.18.52.102

that on sits in a different subnet - is that intentional? are you sure ring2 was ever operational?

also you have heavily modified timeouts in your corosync.conf that can interfere with link down and up detection..

please provide the logs from *all* nodes starting a few minutes earlier.

dlasher · Oct 5, 2022

fabian said:
pmx2 ring2_addr: 198.18.52.102

that on sits in a different subnet - is that intentional? are you sure ring2 was ever operational?

also you have heavily modified timeouts in your corosync.conf that can interfere with link down and up detection..

please provide the logs from *all* nodes starting a few minutes earlier.

RE: pmx2 - Good catch - no that wasn't intentional, fixing it already. From a network standpoint 198.18.50-53.xxx can all ping each other, so the network pieces, yes were all operational. Based on the config however, it looks like pmx2 wasn't on ring2 correctly. That in and of itself shouldn't have broken things.. since it still had ring3 right?

RE: Timeouts - those were added based on threads on the forums, as a last ditch effort. Been fighting this problem a LONG time across multiple versions with no end in sight. Timeouts didn't make things worse, as far as I can tell, but I can remove them.

fabian · Oct 5, 2022

your timeout settings definitely make the situation worse - corosync expects links to go down in case of an outage before the totem timeout is reached, but in your case those timeouts are inversed.

with 7 nodes the default token timeout is 3000 + 650 * (7 - 2) = 6250ms, but your links 2 and 3 take at least 10s to be marked as down. also, your network shouldn't require a ping timeout that is this high (the ping/heartbeat is just a simple UDP packet to see if the other node is reachable!). the default settings in your case would be derived from the token timeout, with an interval (how often the ping is sent) of 6250/4 = 1562.5ms, a timeout (if the pong doesn't arrive within this time the link is dead) of 6250/2=3125ms and two successful ping/pongs to mark a dead link as up again. if your links are flapping with a timeout of ~3s than you are using NICs which are not low-latency enough for corosync purposes.

dlasher · Oct 5, 2022

fabian said:
your timeout settings definitely make the situation worse - corosync expects links to go down in case of an outage before the totem timeout is reached, but in your case those timeouts are inversed.

with 7 nodes the default token timeout is 3000 + 650 * (7 - 2) = 6250ms, but your links 2 and 3 take at least 10s to be marked as down. also, your network shouldn't require a ping timeout that is this high (the ping/heartbeat is just a simple UDP packet to see if the other node is reachable!). the default settings in your case would be derived from the token timeout, with an interval (how often the ping is sent) of 6250/4 = 1562.5ms, a timeout (if the pong doesn't arrive within this time the link is dead) of 6250/2=3125ms and two successful ping/pongs to mark a dead link as up again. if your links are flapping with a timeout of ~3s than you are using NICs which are not low-latency enough for corosync purposes.

That makes sense, thank you, I didn't understand the corosync/totem/cluster-manager inter-op. (Is this written up anywhere I can digest?)

I'll drop the timeouts back to default values. Since I know how to cause the meltdown, it will be easy to test the results of the change.

How would you recommend make the change safely? I suspect there's a couple processes I should stop/*make change*/start, yes?

RE: NIC's - all intel chipset - wouldn't suspect they'd be the problem?

Code:

07:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
07:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
09:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
09:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)

dlasher · Oct 5, 2022

dlasher said:
From an operational standpoint, is there any way to tweak the behavior of fencing?

For example, this cluster has CEPH, and as long as CEPH is happy, I'm fine with all the VM's being shut down, but by all means, don't ()*@#$ reboot!!! It's easily 20 minutes from the time the cluster panics, until all the nodes are back up, ceph is happy, the VM's are restarted, etc, so it's incredibly disruptive. (Not to mention that while CEPHFS is getting happy, HA gets ticked off, and ends up leaving some VM's not started)

@fabian -- any thoughts on this question? I'd love to have more control over the failure steps/scenerios.

fabian · Oct 6, 2022

with HA fencing is needed to recover resources, there's no way around it. if you don't need the automatic fail-over of HA, you can just not use HA, then no fencing should occur.

in case of a cluster partition, no managing of guests/.. on any non-quorate partition(s) is possible. since you are also using Ceph, what will happen in that case depends on whether the Ceph cluster is also split - if it is, guests won't be able to write and will likely have an outage and/or require a reboot. if Ceph is still working, guests should continue working as well, but they won't be able to be started/stopped/reconfigured/migrated/backed up/..

the corosync values are described in man corosync.conf. our HA stack (which sits on top) is described here: https://pve.proxmox.com/pve-docs/chapter-ha-manager.html , our clustering stack (which sits "between" HA and corosync) here: https://pve.proxmox.com/pve-docs/chapter-pmxcfs.html and here: https://pve.proxmox.com/pve-docs/chapter-pvecm.html (this also contains information on how to edit corosync.conf safely).

dlasher · Oct 6, 2022

fabian said:
the corosync values are described in man corosync.conf. our HA stack (which sits on top) is described here: https://pve.proxmox.com/pve-docs/chapter-ha-manager.html , our clustering stack (which sits "between" HA and corosync) here: https://pve.proxmox.com/pve-docs/chapter-pmxcfs.html and here: https://pve.proxmox.com/pve-docs/chapter-pvecm.html (this also contains information on how to edit corosync.conf safely).

Thank you, fantastic information, already used it to clean things up a bit.

if Ceph is still working, guests should continue working as well, but they won't be able to be started/stopped/reconfigured/migrated/backed up/..

Not if PMX thinks we need to reboot. So far, none of the failures have taken down CEPH, it's pmx/HA that gets offended. (Ironic, because corosync/totem has (4) rings, and CEPH sits on a single vlan, but I digress)

The concept of "reboot makes everything better" is fine if the issue is a software failure, but is the absolute wrong action if there's a hardware failure. You'll reboot, and be in the exact same state. Is there a "fence in place" option - that shuts down all the VM's, throws errors (syslog/email/etc) but doesn't reboot the underlying OS? That would be FAR preferred to troubleshoot whatever issue occurred, while preserving the integrity of the rest of the cluster. And if the failure is temporal, that would be a much better recovery strategy.

I'd like to be able to choose the fence behavior..Something other than "reboot". Actually to be more clear, I *never* want it to reboot upon corosync failure. How do I make that happen?

alyarb · Oct 6, 2022

I had a couple Windows VMs get irreparably corrupt after a spontaneous reboot of all my nodes. Also using Ceph. We ended up disabling HA entirely because of it.

If this is merely a matter of taking some switches offline, and you are already familiar with editing corosync.conf, then you can write a script to add/remove rings before you take down the switches for that ring.

I also see you are not setting a knet_link_priority value...

dlasher · Oct 6, 2022

alyarb said:
I had a couple Windows VMs get irreparably corrupt after a spontaneous reboot of all my nodes. Also using Ceph. We ended up disabling HA entirely because of it.

Yeah, feels like only ceph replication has saved me from the heavy hand of rebooting.

I also see you are not setting a knet_link_priority value...

What would you suggest?

alyarb · Oct 6, 2022

just more testing. read the corosync redundancy part about setting ring priorities

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

dlasher · Oct 6, 2022

alyarb said:
just more testing. read the corosync redundancy part about setting ring priorities

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy

Thanks, gave it some thought, and changed the priorities a bit - we'll see if it does better than it has in the past. (Also has me thinking about things that could lower the latency between nodes, like MTU on the ring interfaces)

It would be nice to gather raw data on keepalives across all interfaces for day 7 days, and choose the lowest latency path.

dlasher · Oct 6, 2022

fabian said:
with HA fencing is needed to recover resources, there's no way around it. if you don't need the automatic fail-over of HA, you can just not use HA, then no fencing should occur.

Just to make sure I understand this correctly:

If I remove all the HA-configured LXC/KVM settings (I have DNS servers, video recorders, etc) and make them stand alone no-failover configs, it won't fence if Corosync gets unhappy? (That doesn't seem to ring true to me in a shared storage world.)

fabian · Oct 7, 2022

if HA is not active (no resources configured, HA services or node restarted since the last one was de-configured) the watchdog is not armed and thus no fencing will occur. /etc/pve will go into read-only mode if you lose quorum to prevent concurrent modifications. you can check the HA status with ha-manager status, but doing a reboot to be on the safe side obviously doesn't hurt.

Single ring failure causes cluster reboot? (AKA: We hates the fencing my precious.. we hates it..)

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Proxmox Staff Member

We value your privacy