This means that once a node goes offline, forming a new membership will take ~1min24s (one full token and one full consensus timeout), which is too long HA is active and very likely leads to fencing, see [1]. My suggestion would be to remove all current custom corosync configuration and instead set only a custom token coefficient as described in [1].
I haven't tested how corosync with a customized config like yours would react to a config change, so it might be advisable to disarm HA before making the change (and re-enable it afterwards).
[1] https://forum.proxmox.com/threads/proxmox-with-48-nodes.174684/page-2#post-825826
We reverted our settings as per your suggestion and applied the below only. When these settings were applied HA was not disabled (jfyi)
Code:
totem {
cluster_name: proxmox-prod
config_version: 80
interface {
linknumber: 0
}
ip_version: ipv4-6
link_mode: passive
secauth: on
token_coefficient: 200
version: 2
}
Our findings/testing - deleting a node from the cluster around this time > Jan 15 11:48:22 in the logs attached. The cluster did not lose quorum. Thank you!
Based on this do you still recommend creating a separate network for corosync? The only way possible we see is to create another bond for corosync with LACP (fast) everywhere on the existing switches that are in a (VLT).