40 node prod cluster restarts when joining a new node or removing.

aomer786 · Dec 11, 2025

Need help finding and fixing root cause. Below are some details of my finding so far. The cluster is setup with one bond per node and everything is going though it. There are differnt Vlans but I believe its all essentially going though the same physical bond. Internal ceph is running on 6 nodes in the cluster. Corosync is also communicating on this one single link. All servers in the cluster do have at least 4 physical NICs. I have been coming across running corosync on its own physical NIC quite a lot. Could this be it? us running it on single bond per server causes the entire cluster to restart? I have attached a screen shot of how network on each node is setup on proxmox side.

Here are the current Totem settings: can changing any of these values help? or should these be not tinkered with in production?

Code:

runtime.config.totem.block_unlisted_ips (u32) = 1
runtime.config.totem.cancel_token_hold_on_retransmit (u32) = 0
runtime.config.totem.consensus (u32) = 32460
runtime.config.totem.downcheck (u32) = 1000
runtime.config.totem.fail_recv_const (u32) = 2500
runtime.config.totem.heartbeat_failures_allowed (u32) = 0
runtime.config.totem.hold (u32) = 5142
runtime.config.totem.interface.0.knet_ping_interval (u32) = 6762
runtime.config.totem.interface.0.knet_ping_timeout (u32) = 13525
runtime.config.totem.join (u32) = 50
runtime.config.totem.knet_compression_level (i32) = 0
runtime.config.totem.knet_compression_model (str) = none
runtime.config.totem.knet_compression_threshold (u32) = 0
runtime.config.totem.knet_mtu (u32) = 0
runtime.config.totem.knet_pmtud_interval (u32) = 30
runtime.config.totem.max_messages (u32) = 17
runtime.config.totem.max_network_delay (u32) = 50
runtime.config.totem.merge (u32) = 200
runtime.config.totem.miss_count_const (u32) = 5
runtime.config.totem.send_join (u32) = 0
runtime.config.totem.seqno_unchanged_const (u32) = 30
runtime.config.totem.token (u32) = 27050
runtime.config.totem.token_retransmit (u32) = 6440
runtime.config.totem.token_retransmits_before_loss_const (u32) = 4
runtime.config.totem.token_warning (u32) = 75
runtime.config.totem.window_size (u32) = 50
totem.cluster_name (str) = proxmox-prod

Bunch of logs like this in corosync syslog on a few nodes: below is one example.

Code:

Dec 10 22:48:50 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 19e6f7
Dec 10 23:10:10 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] link: host: 9 link: 0 is down
Dec 10 23:10:10 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] host: host: 9 (passive) best link: 0 (pri: 1)
Dec 10 23:10:10 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] host: host: 9 has no active links
Dec 10 23:10:11 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] link: Resetting MTU for link 0 because host 9 joined
Dec 10 23:10:11 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] host: host: 9 (passive) best link: 0 (pri: 1)
Dec 10 23:10:11 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Dec 10 23:13:30 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 1baf4c
Dec 10 23:21:43 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 1c4b06
Dec 11 00:06:17 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 1f7228
Dec 11 01:18:02 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 248b62
Dec 11 01:18:19 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] link: host: 14 link: 0 is down
Dec 11 01:18:19 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] host: host: 14 (passive) best link: 0 (pri: 1)
Dec 11 01:18:19 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] host: host: 14 has no active links
Dec 11 01:18:19 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] link: Resetting MTU for link 0 because host 14 joined
Dec 11 01:18:19 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] host: host: 14 (passive) best link: 0 (pri: 1)
Dec 11 01:18:19 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Dec 11 01:30:22 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 25726e
Dec 11 01:40:14 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 262273
Dec 11 01:47:42 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 26affb
Dec 11 02:35:15 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 2a172c
Dec 11 02:38:00 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 2a4923
Dec 11 03:30:23 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 2e1ca2
Dec 11 04:01:36 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 305aca
Dec 11 04:15:36 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 315c94
Dec 11 04:15:55 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 31627f
Dec 11 04:19:30 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 31a598
Dec 11 04:25:55 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3218d2
Dec 11 04:39:14 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3311fe
Dec 11 04:45:17 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 33808d
Dec 11 04:59:10 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3480bb
Dec 11 05:21:26 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 361b63
Dec 11 05:44:05 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 37bc3e
Dec 11 05:47:00 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 37f023
Dec 11 06:02:57 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 391264
Dec 11 06:13:23 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 39d830
Dec 11 06:18:33 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a3368
Dec 11 06:18:33 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a338e
Dec 11 06:18:33 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a33e1
Dec 11 06:18:33 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a33ef
Dec 11 06:18:33 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a33f2
Dec 11 06:18:33 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a33f4
Dec 11 06:18:33 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a33f6
Dec 11 06:18:33 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a33f7
Dec 11 06:18:34 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a3425
Dec 11 06:18:34 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a3426
Dec 11 06:18:34 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a342e
Dec 11 06:18:38 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a34e4
Dec 11 06:18:38 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a34e6
Dec 11 06:18:38 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a353d
Dec 11 06:18:38 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a353e
Dec 11 07:33:53 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3fa465
Dec 11 07:38:09 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] link: host: 22 link: 0 is down
Dec 11 07:38:09 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] host: host: 22 (passive) best link: 0 (pri: 1)
Dec 11 07:38:09 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] host: host: 22 has no active links
Dec 11 07:38:11 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] link: Resetting MTU for link 0 because host 22 joined
Dec 11 07:38:11 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] host: host: 22 (passive) best link: 0 (pri: 1)
Dec 11 07:38:11 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Dec 11 07:43:06 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 404676

fba · Dec 11, 2025

There is some documentation available for Corosync network setup: https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_cluster_network
Using bonds is not advised, shared links with other traffic types are not advised too.

aomer786 · Dec 11, 2025

Thank you for the reply. I understand the recommendation now. Just based on the logs provided; is this in-fact a corosync issue being on the same physical network as others? or are there any tuning settings we can apply to help us through this while we try and figure out how to set up cronosync redundantly?

fba · Dec 11, 2025

For corosync tuning it might be best to ask at the corresponding mailing list users@clusterlabs.org, you can subscribe here https://projects.clusterlabs.org/w/clusterlabs/clusterlabs_mailing_lists/

fweber · Dec 12, 2025

Hi, from a quick look, the "Retransmit" messages may be a symptom of network stability issues (e.g. lost packets, increased latency etc) that are more likely to occur if corosync shares a physical network with other traffic types -- I'd expect that running corosync on a dedicated physical network should bring an improvement there. But apart from that, with 40 nodes some corosync fine-tuning might be necessary, see [2] for more information. An alternative would be to split the 40-node cluster into two smaller 20 node clusters and using the Proxmox Datacenter Manager to manage the two clusters.

[1] https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_cluster_requirements
[2] https://forum.proxmox.com/threads/proxmox-with-48-nodes.174684/post-812204

aomer786 · Dec 12, 2025

Hello, Thank you for your response. We will look into your suggestions.

fba · Dec 12, 2025

fweber said:
But apart from that, with 40 nodes some corosync fine-tuning might be necessary, see [2] for more information.
[2] https://forum.proxmox.com/threads/proxmox-with-48-nodes.174684/post-812204

What information regarding tuning corosync for a 40-node cluster are you referring to in the linked post?

fweber · Dec 12, 2025

fba said:
What information regarding tuning corosync for a 40-node cluster are you referring to in the linked post?

Right now, the fact that we're working on providing guidance for necessary corosync tweaks for bigger clusters. When that is available, we can update the linked post (to avoid scattering information across multiple posts).

fba · Dec 12, 2025

Do you have an ETA already?

SteveITS · Dec 12, 2025

aomer786 said:
figure out how to set up cronosync redundantly

You can just add it to other NICs, recommended is to use several: https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_redundancy ("Adding Redundant Links To An Existing Cluster")

aomer786 · Dec 12, 2025

SteveITS said:
You can just add it to other NICs, recommended is to use several: https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_redundancy ("Adding Redundant Links To An Existing Cluster")

The team at the moment is busy recovering from the 2 back to back outages we had this week. There are plans next week to review the Docs. I suppose what I am looking for at the moment is concrete evidence that the outages was caused by corosync. Not sure if there are any other logs I can provide to conclude the root cause. Also, waiting to see what the corosync tuning recommendations are for a large cluster. Any estimated timeline on this?

Some more additional info about the network setup from our Network Engineer. "Each server has a 20Gb bond, with relatively low utilization. Our switches are all interconnected with 200Gb bonds, each switch can handle 1.2Tb of traffic, we're barely using 1% of that."

Bonds are setup on the servers like so:

Code:

auto bond0
iface bond0 inet manual
        bond-slaves eno1 eno2
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3
        bond-lacp-rate 1
        bond-downdelay 200
        bondupdelay 200

There are other concerns as well related to "pmtud: Global data MTU changed to: 1397" as far as I can tell on the server and switch side its all setup to use "mtu 1500"

fba · Dec 12, 2025

aomer786 said:
There are other concerns as well related to "pmtud: Global data MTU changed to: 1397" as far as I can tell on the server and switch side its all setup to use "mtu 1500"

AFAIK totally common as this is the message size, so 1500 minus overhead.

SteveITS · Dec 12, 2025

@aomer786 There have been various posts here about max cluster size, like the one linked above. The concern with corosync is latency not bandwidth. So if enough milliseconds go by the node is considered to not be responding. A NIC can't send two packets at the same time regardless of overall bandwidth usage. Hence the recommendation to use dedicated 1 Gbps NICs for the primary corosync.

Nodes that lose connection to the cluster will reboot, and if more than half are offline the remainder will reboot because there is no quorum. The redundant links may help in the sense that it can check using several NICs.

It seems like there are a few issues with the cluster design, and if you're at the practical maximum (seeing problems), adding more nodes will make it worse. In fact it may help to shut down or remove a few nodes...? Then there will be less corosync traffic. And/or have two clusters, as mentioned.

> concrete evidence that the outages was caused by corosync

I haven't run into that myself so don't know the log entries to look for, however, since nodes are supposed to reboot when they lose connection that is the most likely cause of spontaneous node reboots, at least that I see posted here.

aomer786 · Dec 12, 2025

@SteveITS Thank You! Currently even if we try to remove a node the entire cluster reboots. Perhaps,we need to setup the new link before we can proceed. I wonder if adding a new link in a running cluster can cause the cluster needing or causes to restart?

One of the nodes was stuck in an endless reboot cycle and is currently turned off.

SteveITS · Dec 12, 2025

Based on my experience with a small cluster when we were getting started, we had no issues doing it on a live system and following the directions. We added another link then as a second step changed the order as I recall, so the new one was ring0.

I don't know what the timeout threshold is, but hypothetically "n-1 nanoseconds" could be fine while "n+1 nanoseconds" causes nodes to time out. They're basically just constantly poking each other. I'd guess it takes just long enough to remove a node to trip the sensors. And I suppose that means more corosync links are not necessarily guaranteed to solve this but it's relatively easy to try. Another option would be to remove nodes until the problem goes away.

fweber · Dec 18, 2025

fweber said:
Hi, from a quick look, the "Retransmit" messages may be a symptom of network stability issues (e.g. lost packets, increased latency etc) that are more likely to occur if corosync shares a physical network with other traffic types -- I'd expect that running corosync on a dedicated physical network should bring an improvement there. But apart from that, with 40 nodes some corosync fine-tuning might be necessary, see [2] for more information. [...]

fba said:
Do you have an ETA already?

Hi, I posted a status update here: https://forum.proxmox.com/threads/proxmox-with-48-nodes.174684/post-825826

aomer786 · Jan 5, 2026

fweber said:
Hi, I posted a status update here: https://forum.proxmox.com/threads/proxmox-with-48-nodes.174684/post-825826

@fweber Thank you for your response. Unfortunately, the below configuration was applied before your suggestions were posted --- and systemctl stop pve-ha-lrm / systemctl stop pve-ha-crm was not performed when updating corosync config changes as suggested here. Which resulted in yet another cluster wide restarts of all the nodes.

Code:

logging {
  debug: off
  to_syslog: yes -> no
  to_logfile: yes
  logfile: /var/log/corosync/corosync.log
  logfile_priority: info
  timestamp: on
}
nodelist {
...
}
quorum {
  provider: corosync_votequorum
}
totem {
  cluster_name: proxmox-prod
  config_version: 73
  token: 3000 -> 20000
  consensus: 3600 -> 26000
  token_retransmits_before_loss_const: 4 -> 5
  token_coefficient: 650 -> 1000
  join: 50 -> 200
  max_network_delay: 50 -> 300
  interface {
    linknumber: 0
    knet_ping_interval: 750 -> 3000
    knet_ping_timeout: 1500 -> 15000
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

fweber · Jan 8, 2026

Hi!

aomer786 said:
@fweber Thank you for your response. Unfortunately, the below configuration was applied before your suggestions were posted --- and systemctl stop pve-ha-lrm / systemctl stop pve-ha-crm was not performed when updating corosync config changes as suggested here. Which resulted in yet another cluster wide restarts of all the nodes.

Thanks, I see -- I edited my post to add a pointer for disarming HA [1].

aomer786 said:

Code:

logging {
  debug: off
  to_syslog: yes -> no
  to_logfile: yes
  logfile: /var/log/corosync/corosync.log
  logfile_priority: info
  timestamp: on
}
nodelist {
...
}
quorum {
  provider: corosync_votequorum
}
totem {
  cluster_name: proxmox-prod
  config_version: 73
  token: 3000 -> 20000
  consensus: 3600 -> 26000
  token_retransmits_before_loss_const: 4 -> 5
  token_coefficient: 650 -> 1000
  join: 50 -> 200
  max_network_delay: 50 -> 300
  interface {
    linknumber: 0
    knet_ping_interval: 750 -> 3000
    knet_ping_timeout: 1500 -> 15000
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Assuming that the values right to the "->" are the current values -- do you still see the issue? If I'm reading the corosync.conf manpage right, setting token and token_coefficient to these values in a 40-node cluster would actually result in higher timeouts than in the default config, so the cluster would be more likely to run into the fencing situation described in [1].
To check this, could you please send the output of the following command?

Code:

corosync-cmapctl runtime.config.totem

In general, I'd suggest to keep the corosync config as close to the default config as possible.

[1] https://forum.proxmox.com/threads/proxmox-with-48-nodes.174684/post-825826
[2] https://manpages.debian.org/trixie/corosync/corosync.conf.5.en.html#token_coefficient

aomer786 · Jan 8, 2026

fweber said:
Hi!

Thanks, I see -- I edited my post to add a pointer for disarming HA [1].

Assuming that the values right to the "->" are the current values -- do you still see the issue? If I'm reading the corosync.conf manpage right, setting token and token_coefficient to these values in a 40-node cluster would actually result in higher timeouts than in the default config, so the cluster would be more likely to run into the fencing situation described in [1].
To check this, could you please send the output of the following command?

Code:

corosync-cmapctl runtime.config.totem

In general, I'd suggest to keep the corosync config as close to the default config as possible.

[1] https://forum.proxmox.com/threads/proxmox-with-48-nodes.174684/post-825826
[2] https://manpages.debian.org/trixie/corosync/corosync.conf.5.en.html#token_coefficient

"Assuming that the values right to the "->" are the current values -- do you still see the issue?"

That is correct. Nothing was done that would trigger a test of these settings, like adding or removing a node. We do have plans to drain one of the nodes and perform maintenance on it. (RAM errors)

Wonder if ha-manager node-maintenance enable <node-name>
Is enough, or should we disarm HA? Then proceed. Given the corosysnc network is still the same as before.

corosync-cmapctl runtime.config.totem :

Code:

runtime.config.totem.block_unlisted_ips (u32) = 1
runtime.config.totem.cancel_token_hold_on_retransmit (u32) = 0
runtime.config.totem.consensus (u32) = 26000
runtime.config.totem.downcheck (u32) = 1000
runtime.config.totem.fail_recv_const (u32) = 2500
runtime.config.totem.heartbeat_failures_allowed (u32) = 0
runtime.config.totem.hold (u32) = 8912
runtime.config.totem.interface.0.knet_ping_interval (u32) = 3000
runtime.config.totem.interface.0.knet_ping_timeout (u32) = 15000
runtime.config.totem.join (u32) = 200
runtime.config.totem.knet_compression_level (i32) = 0
runtime.config.totem.knet_compression_model (str) = none
runtime.config.totem.knet_compression_threshold (u32) = 0
runtime.config.totem.knet_mtu (u32) = 0
runtime.config.totem.knet_pmtud_interval (u32) = 30
runtime.config.totem.max_messages (u32) = 17
runtime.config.totem.max_network_delay (u32) = 300
runtime.config.totem.merge (u32) = 200
runtime.config.totem.miss_count_const (u32) = 5
runtime.config.totem.send_join (u32) = 0
runtime.config.totem.seqno_unchanged_const (u32) = 30
runtime.config.totem.token (u32) = 58000
runtime.config.totem.token_retransmit (u32) = 11153
runtime.config.totem.token_retransmits_before_loss_const (u32) = 5
runtime.config.totem.token_warning (u32) = 75
runtime.config.totem.window_size (u32) = 50

fweber · Jan 9, 2026

aomer786 said:

This means that once a node goes offline, forming a new membership will take ~1min24s (one full token and one full consensus timeout), which is too long HA is active and very likely leads to fencing, see [1]. My suggestion would be to remove all current custom corosync configuration and instead set only a custom token coefficient as described in [1].
I haven't tested how corosync with a customized config like yours would react to a config change, so it might be advisable to disarm HA before making the change (and re-enable it afterwards).

[1] https://forum.proxmox.com/threads/proxmox-with-48-nodes.174684/page-2#post-825826

40 node prod cluster restarts when joining a new node or removing.

New Member

Attachments

Renowned Member

New Member

Renowned Member

Proxmox Staff Member

New Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

New Member

Renowned Member

Renowned Member

New Member

Renowned Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

We value your privacy