40 node prod cluster restarts when joining a new node or removing.

aomer786 · Jan 15, 2026

fweber said:
This means that once a node goes offline, forming a new membership will take ~1min24s (one full token and one full consensus timeout), which is too long HA is active and very likely leads to fencing, see [1]. My suggestion would be to remove all current custom corosync configuration and instead set only a custom token coefficient as described in [1].
I haven't tested how corosync with a customized config like yours would react to a config change, so it might be advisable to disarm HA before making the change (and re-enable it afterwards).

[1] https://forum.proxmox.com/threads/proxmox-with-48-nodes.174684/page-2#post-825826

We reverted our settings as per your suggestion and applied the below only. When these settings were applied HA was not disabled (jfyi)

Code:

totem {
  cluster_name: proxmox-prod
  config_version: 80
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  token_coefficient: 200
  version: 2
}

Our findings/testing - deleting a node from the cluster around this time > Jan 15 11:48:22 in the logs attached. The cluster did not lose quorum. Thank you!

Based on this do you still recommend creating a separate network for corosync? The only way possible we see is to create another bond for corosync with LACP (fast) everywhere on the existing switches that are in a (VLT).

fweber · Feb 11, 2026

aomer786 said:
We reverted our settings as per your suggestion and applied the below only. When these settings were applied HA was not disabled (jfyi)

Code:

totem { cluster_name: proxmox-prod config_version: 80 interface { linknumber: 0 } ip_version: ipv4-6 link_mode: passive secauth: on token_coefficient: 200 version: 2 }

Our findings/testing - deleting a node from the cluster around this time > Jan 15 11:48:22 in the logs attached. The cluster did not lose quorum. Thank you!

Sounds good, thanks for reporting back!

aomer786 said:
Based on this do you still recommend creating a separate network for corosync? The only way possible we see is to create another bond for corosync with LACP (fast) everywhere on the existing switches that are in a (VLT).

I'd generally recommend a dedicated primary network for corosync, to prevent other traffic from driving up corosync latencies, see [1]. corosync can handle multiple redundant networks itself [2], so creating a bond for corosync is usually not necessary, though it can be an option to configure a redundant corosync network over a pre-existing bond shared with other traffic types. See [3] for some additional considerations that are relevant when running corosync over a bond [3].

[1] https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_cluster_requirements
[2] https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_redundancy
[3] https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_corosync_over_bonds

spirit · Feb 13, 2026

fweber said:
Sounds good, thanks for reporting back!

I'd generally recommend a dedicated primary network for corosync, to prevent other traffic from driving up corosync latencies, see [1]. corosync can handle multiple redundant networks itself [2], so creating a bond for corosync is usually not necessary, though it can be an option to configure a redundant corosync network over a pre-existing bond shared with other traffic types. See [3] for some additional considerations that are relevant when running corosync over a bond [3].

[1] https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_cluster_requirements
[2] https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_redundancy
[3] https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_corosync_over_bonds

from last corosync release
https://github.com/corosync/corosync/releases

"A new option (totem.ip_dscp) is available to configure DSCP for traffic
prioritization. Thanks to David Hanisch for this great improvement."

could be interesting for stretched cluster across datacenter where links can be 100% dedicated to corosync.
or for cluster without dedicated interfaces

(of course you need support on your switches)

Search

Search

40 node prod cluster restarts when joining a new node or removing.

aomer786

New Member

Attachments

fweber

Proxmox Staff Member

spirit

Distinguished Member

We value your privacy