40 node prod cluster restarts when joining a new node or removing.

aomer786

New Member
Aug 6, 2025
5
0
1
Need help finding and fixing root cause. Below are some details of my finding so far. The cluster is setup with one bond per node and everything is going though it. There are differnt Vlans but I believe its all essentially going though the same physical bond. Internal ceph is running on 6 nodes in the cluster. Corosync is also communicating on this one single link. All servers in the cluster do have at least 4 physical NICs. I have been coming across running corosync on its own physical NIC quite a lot. Could this be it? us running it on single bond per server causes the entire cluster to restart? I have attached a screen shot of how network on each node is setup on proxmox side.

Here are the current Totem settings: can changing any of these values help? or should these be not tinkered with in production?

Code:
runtime.config.totem.block_unlisted_ips (u32) = 1
runtime.config.totem.cancel_token_hold_on_retransmit (u32) = 0
runtime.config.totem.consensus (u32) = 32460
runtime.config.totem.downcheck (u32) = 1000
runtime.config.totem.fail_recv_const (u32) = 2500
runtime.config.totem.heartbeat_failures_allowed (u32) = 0
runtime.config.totem.hold (u32) = 5142
runtime.config.totem.interface.0.knet_ping_interval (u32) = 6762
runtime.config.totem.interface.0.knet_ping_timeout (u32) = 13525
runtime.config.totem.join (u32) = 50
runtime.config.totem.knet_compression_level (i32) = 0
runtime.config.totem.knet_compression_model (str) = none
runtime.config.totem.knet_compression_threshold (u32) = 0
runtime.config.totem.knet_mtu (u32) = 0
runtime.config.totem.knet_pmtud_interval (u32) = 30
runtime.config.totem.max_messages (u32) = 17
runtime.config.totem.max_network_delay (u32) = 50
runtime.config.totem.merge (u32) = 200
runtime.config.totem.miss_count_const (u32) = 5
runtime.config.totem.send_join (u32) = 0
runtime.config.totem.seqno_unchanged_const (u32) = 30
runtime.config.totem.token (u32) = 27050
runtime.config.totem.token_retransmit (u32) = 6440
runtime.config.totem.token_retransmits_before_loss_const (u32) = 4
runtime.config.totem.token_warning (u32) = 75
runtime.config.totem.window_size (u32) = 50
totem.cluster_name (str) = proxmox-prod

Bunch of logs like this in corosync syslog on a few nodes: below is one example.

Code:
Dec 10 22:48:50 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 19e6f7
Dec 10 23:10:10 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] link: host: 9 link: 0 is down
Dec 10 23:10:10 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] host: host: 9 (passive) best link: 0 (pri: 1)
Dec 10 23:10:10 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] host: host: 9 has no active links
Dec 10 23:10:11 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] link: Resetting MTU for link 0 because host 9 joined
Dec 10 23:10:11 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] host: host: 9 (passive) best link: 0 (pri: 1)
Dec 10 23:10:11 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Dec 10 23:13:30 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 1baf4c
Dec 10 23:21:43 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 1c4b06
Dec 11 00:06:17 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 1f7228
Dec 11 01:18:02 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 248b62
Dec 11 01:18:19 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] link: host: 14 link: 0 is down
Dec 11 01:18:19 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] host: host: 14 (passive) best link: 0 (pri: 1)
Dec 11 01:18:19 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] host: host: 14 has no active links
Dec 11 01:18:19 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] link: Resetting MTU for link 0 because host 14 joined
Dec 11 01:18:19 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] host: host: 14 (passive) best link: 0 (pri: 1)
Dec 11 01:18:19 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Dec 11 01:30:22 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 25726e
Dec 11 01:40:14 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 262273
Dec 11 01:47:42 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 26affb
Dec 11 02:35:15 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 2a172c
Dec 11 02:38:00 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 2a4923
Dec 11 03:30:23 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 2e1ca2
Dec 11 04:01:36 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 305aca
Dec 11 04:15:36 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 315c94
Dec 11 04:15:55 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 31627f
Dec 11 04:19:30 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 31a598
Dec 11 04:25:55 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3218d2
Dec 11 04:39:14 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3311fe
Dec 11 04:45:17 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 33808d
Dec 11 04:59:10 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3480bb
Dec 11 05:21:26 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 361b63
Dec 11 05:44:05 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 37bc3e
Dec 11 05:47:00 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 37f023
Dec 11 06:02:57 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 391264
Dec 11 06:13:23 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 39d830
Dec 11 06:18:33 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a3368
Dec 11 06:18:33 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a338e
Dec 11 06:18:33 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a33e1
Dec 11 06:18:33 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a33ef
Dec 11 06:18:33 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a33f2
Dec 11 06:18:33 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a33f4
Dec 11 06:18:33 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a33f6
Dec 11 06:18:33 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a33f7
Dec 11 06:18:34 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a3425
Dec 11 06:18:34 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a3426
Dec 11 06:18:34 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a342e
Dec 11 06:18:38 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a34e4
Dec 11 06:18:38 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a34e6
Dec 11 06:18:38 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a353d
Dec 11 06:18:38 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3a353e
Dec 11 07:33:53 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 3fa465
Dec 11 07:38:09 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] link: host: 22 link: 0 is down
Dec 11 07:38:09 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] host: host: 22 (passive) best link: 0 (pri: 1)
Dec 11 07:38:09 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] host: host: 22 has no active links
Dec 11 07:38:11 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] link: Resetting MTU for link 0 because host 22 joined
Dec 11 07:38:11 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] host: host: 22 (passive) best link: 0 (pri: 1)
Dec 11 07:38:11 proxmox-prod03.sgdctroy.net corosync[3941]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Dec 11 07:43:06 proxmox-prod03.sgdctroy.net corosync[3941]:   [TOTEM ] Retransmit List: 404676
 

Attachments

  • Screenshot 2025-12-11 at 10.14.26 AM.png
    Screenshot 2025-12-11 at 10.14.26 AM.png
    137.5 KB · Views: 7
Thank you for the reply. I understand the recommendation now. Just based on the logs provided; is this in-fact a corosync issue being on the same physical network as others? or are there any tuning settings we can apply to help us through this while we try and figure out how to set up cronosync redundantly?
 
Hi, from a quick look, the "Retransmit" messages may be a symptom of network stability issues (e.g. lost packets, increased latency etc) that are more likely to occur if corosync shares a physical network with other traffic types -- I'd expect that running corosync on a dedicated physical network should bring an improvement there. But apart from that, with 40 nodes some corosync fine-tuning might be necessary, see [2] for more information. An alternative would be to split the 40-node cluster into two smaller 20 node clusters and using the Proxmox Datacenter Manager to manage the two clusters.

[1] https://pve.proxmox.com/pve-docs/chapter-pvecm.html#pvecm_cluster_requirements
[2] https://forum.proxmox.com/threads/proxmox-with-48-nodes.174684/post-812204
 
What information regarding tuning corosync for a 40-node cluster are you referring to in the linked post?
Right now, the fact that we're working on providing guidance for necessary corosync tweaks for bigger clusters. When that is available, we can update the linked post (to avoid scattering information across multiple posts).
 
You can just add it to other NICs, recommended is to use several: https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_redundancy ("Adding Redundant Links To An Existing Cluster")
The team at the moment is busy recovering from the 2 back to back outages we had this week. There are plans next week to review the Docs. I suppose what I am looking for at the moment is concrete evidence that the outages was caused by corosync. Not sure if there are any other logs I can provide to conclude the root cause. Also, waiting to see what the corosync tuning recommendations are for a large cluster. Any estimated timeline on this?

Some more additional info about the network setup from our Network Engineer. "Each server has a 20Gb bond, with relatively low utilization. Our switches are all interconnected with 200Gb bonds, each switch can handle 1.2Tb of traffic, we're barely using 1% of that."

Bonds are setup on the servers like so:
Code:
auto bond0
iface bond0 inet manual
        bond-slaves eno1 eno2
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3
        bond-lacp-rate 1
        bond-downdelay 200
        bondupdelay 200


There are other concerns as well related to "pmtud: Global data MTU changed to: 1397" as far as I can tell on the server and switch side its all setup to use "mtu 1500"
 
There are other concerns as well related to "pmtud: Global data MTU changed to: 1397" as far as I can tell on the server and switch side its all setup to use "mtu 1500"
AFAIK totally common as this is the message size, so 1500 minus overhead.
 
@aomer786 There have been various posts here about max cluster size, like the one linked above. The concern with corosync is latency not bandwidth. So if enough milliseconds go by the node is considered to not be responding. A NIC can't send two packets at the same time regardless of overall bandwidth usage. Hence the recommendation to use dedicated 1 Gbps NICs for the primary corosync.

Nodes that lose connection to the cluster will reboot, and if more than half are offline the remainder will reboot because there is no quorum. The redundant links may help in the sense that it can check using several NICs.

It seems like there are a few issues with the cluster design, and if you're at the practical maximum (seeing problems), adding more nodes will make it worse. In fact it may help to shut down or remove a few nodes...? Then there will be less corosync traffic. And/or have two clusters, as mentioned.

> concrete evidence that the outages was caused by corosync

I haven't run into that myself so don't know the log entries to look for, however, since nodes are supposed to reboot when they lose connection that is the most likely cause of spontaneous node reboots, at least that I see posted here.
 
  • Like
Reactions: Johannes S
@SteveITS Thank You! Currently even if we try to remove a node the entire cluster reboots. Perhaps,we need to setup the new link before we can proceed. I wonder if adding a new link in a running cluster can cause the cluster needing or causes to restart?

One of the nodes was stuck in an endless reboot cycle and is currently turned off.
 
Last edited:
Based on my experience with a small cluster when we were getting started, we had no issues doing it on a live system and following the directions. We added another link then as a second step changed the order as I recall, so the new one was ring0.

I don't know what the timeout threshold is, but hypothetically "n-1 nanoseconds" could be fine while "n+1 nanoseconds" causes nodes to time out. They're basically just constantly poking each other. I'd guess it takes just long enough to remove a node to trip the sensors. And I suppose that means more corosync links are not necessarily guaranteed to solve this but it's relatively easy to try. Another option would be to remove nodes until the problem goes away.
 
  • Like
Reactions: Johannes S
Hi, from a quick look, the "Retransmit" messages may be a symptom of network stability issues (e.g. lost packets, increased latency etc) that are more likely to occur if corosync shares a physical network with other traffic types -- I'd expect that running corosync on a dedicated physical network should bring an improvement there. But apart from that, with 40 nodes some corosync fine-tuning might be necessary, see [2] for more information. [...]
Do you have an ETA already?
Hi, I posted a status update here: https://forum.proxmox.com/threads/proxmox-with-48-nodes.174684/post-825826