40 node prod cluster restarts when joining a new node or removing.

This means that once a node goes offline, forming a new membership will take ~1min24s (one full token and one full consensus timeout), which is too long HA is active and very likely leads to fencing, see [1]. My suggestion would be to remove all current custom corosync configuration and instead set only a custom token coefficient as described in [1].
I haven't tested how corosync with a customized config like yours would react to a config change, so it might be advisable to disarm HA before making the change (and re-enable it afterwards).

[1] https://forum.proxmox.com/threads/proxmox-with-48-nodes.174684/page-2#post-825826

We reverted our settings as per your suggestion and applied the below only. When these settings were applied HA was not disabled (jfyi)

Code:
totem {
  cluster_name: proxmox-prod
  config_version: 80
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  token_coefficient: 200
  version: 2
}

Our findings/testing - deleting a node from the cluster around this time > Jan 15 11:48:22 in the logs attached. The cluster did not lose quorum. Thank you!

Based on this do you still recommend creating a separate network for corosync? The only way possible we see is to create another bond for corosync with LACP (fast) everywhere on the existing switches that are in a (VLT).
 

Attachments