Corosync stops working when any 3 nodes are up but works when only 2 are

tubededentifrice · Sep 14, 2022

Hey there
I'm experiencing a very weird issue with my Proxmox cluster (version 7.2-7). Everything is working just fine with 2 nodes, but when the third one is up then corosync stops working. This caused me a lot of troubles to have the third node join the cluster, until I restarted corosync on another node by accident while doing so, which allowed it to complete.

When *any* 2 nodes are up, corosync logs are clear, no "Retransmit List" and so on. As soon as I start the third one, then I can't write anything anymore in /etc/pve and I'm seeing tons of "Retransmit List" in the logs. Usually it's flooding node 1 and 3, whereas node 2 doesn't, so I'm assuming it's node 2 which doesn't respond. And this happens no matter if the node that was initially stopped was 1, 2 or 3.
I'm also seeing messages like "pvescheduler[555758]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout", in case it's related.

NB: Nodes are bare metal servers at OVH hosted in 3 different datacenters.

Things I tried without luck:
- Giving 10 votes to node1, but that didn't change anything to the situation
- Using zerotier-one IPs for corosync (thinking maybe something is blocked on OVH side), same

corosync.conf config: https://gist.github.com/tubededentifrice/36442fc07a34583f05f91cbdeb45f7ab

Logs when corosync is stopped on node3 and restarted on node1 and node2: https://gist.github.com/tubededentifrice/c510a6aae46b4a70da87fe2f815677f0

And when I start the third one: https://gist.github.com/tubededentifrice/7a942688da7fb3a5faf0980b1e6ce362

Any idea what could be going wrong?

Thanks

Edit:
NB:
- Firewall is fully open between nodes, but blocked from the outside.
- I'm using an authenticating proxy on top of my servers, so if you try to access node[123].mydomain.com using DNS IP then you'll be screwed, and :8006 is blocked. However if using /etc/hosts domains -> IP (everything is set properly there), then you'll be fine. Same as using direct IPs. Although I see no reason why this could be related, mentionning it just in case.
- And yes I triple checked that /etc/hosts is correct, most notably the "pvelocalhost" is set to the correct public IP of the server (should it be set to loopback interface instead?)

spirit · Sep 15, 2022

NB: Nodes are bare metal servers at OVH hosted in 3 different datacenters.

Could be realted to latency, but I have setup a small cluster for a customer at ovh on 3 different DC without problem.

(locations : roubaix, graveline, strasbourg)

The corosync network was setup on top of ovh vrack. latency between roubaix<->graveline was 1ms, roubaix<-> strasbourg && graveline<->strasbourg was around 10ms.

(don't use public links)

I hade 2 servers at roubaix, 2 servers at graveline, 1 server at strasbourg (no vm or storage at strasbourg, only used for quorum. Storage is ceph replicated between roubaix && strasbourg).

tubededentifrice · Sep 15, 2022

It's my "personal cloud" so my servers are very cheap and not eligible to vRack as far as I know. I'm using it to host my personal projects and stuff.

node 1 is at RBX3
node 2 is at WAW1 (Poland) -- it' was the only location with that kind of server available at the time
node 3 is at GRA2

ping between 1 and 2: 25.6 ms
ping between 1 and 3: 1.83 ms
ping between 2 and 3: 27.7ms

so the ping might play some role in there, but I don't understand why it would work just fine with 2 nodes (eg 1 and 2), but not with 3.
Is there a way I could make that work? eg. by changing the votes or something? I don't need it to be fast, as I'm only using the clustering feature to centraly manage the firewall and be able to move VM between nodes, but I'm not using replication, ceph and other advanced features.

tubededentifrice · Sep 15, 2022

Holly cow, based on that I found this message: https://forum.proxmox.com/threads/a...after-upgrade-5-4-to-6-0-4.56425/#post-260570

Set
`knet_transport: sctp`
and
`token: 10000`

in corosync.conf and the issue went away!

Search

Search

Corosync stops working when any 3 nodes are up but works when only 2 are

tubededentifrice

New Member

spirit

Distinguished Member

tubededentifrice

New Member

tubededentifrice

New Member