Hey there
I'm experiencing a very weird issue with my Proxmox cluster (version 7.2-7). Everything is working just fine with 2 nodes, but when the third one is up then corosync stops working. This caused me a lot of troubles to have the third node join the cluster, until I restarted corosync on another node by accident while doing so, which allowed it to complete.
When *any* 2 nodes are up, corosync logs are clear, no "Retransmit List" and so on. As soon as I start the third one, then I can't write anything anymore in /etc/pve and I'm seeing tons of "Retransmit List" in the logs. Usually it's flooding node 1 and 3, whereas node 2 doesn't, so I'm assuming it's node 2 which doesn't respond. And this happens no matter if the node that was initially stopped was 1, 2 or 3.
I'm also seeing messages like "pvescheduler[555758]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout", in case it's related.
NB: Nodes are bare metal servers at OVH hosted in 3 different datacenters.
Things I tried without luck:
- Giving 10 votes to node1, but that didn't change anything to the situation
- Using zerotier-one IPs for corosync (thinking maybe something is blocked on OVH side), same
corosync.conf config: https://gist.github.com/tubededentifrice/36442fc07a34583f05f91cbdeb45f7ab
Logs when corosync is stopped on node3 and restarted on node1 and node2: https://gist.github.com/tubededentifrice/c510a6aae46b4a70da87fe2f815677f0
And when I start the third one: https://gist.github.com/tubededentifrice/7a942688da7fb3a5faf0980b1e6ce362
Any idea what could be going wrong?
Thanks
Edit:
NB:
- Firewall is fully open between nodes, but blocked from the outside.
- I'm using an authenticating proxy on top of my servers, so if you try to access node[123].mydomain.com using DNS IP then you'll be screwed, and :8006 is blocked. However if using /etc/hosts domains -> IP (everything is set properly there), then you'll be fine. Same as using direct IPs. Although I see no reason why this could be related, mentionning it just in case.
- And yes I triple checked that /etc/hosts is correct, most notably the "pvelocalhost" is set to the correct public IP of the server (should it be set to loopback interface instead?)
I'm experiencing a very weird issue with my Proxmox cluster (version 7.2-7). Everything is working just fine with 2 nodes, but when the third one is up then corosync stops working. This caused me a lot of troubles to have the third node join the cluster, until I restarted corosync on another node by accident while doing so, which allowed it to complete.
When *any* 2 nodes are up, corosync logs are clear, no "Retransmit List" and so on. As soon as I start the third one, then I can't write anything anymore in /etc/pve and I'm seeing tons of "Retransmit List" in the logs. Usually it's flooding node 1 and 3, whereas node 2 doesn't, so I'm assuming it's node 2 which doesn't respond. And this happens no matter if the node that was initially stopped was 1, 2 or 3.
I'm also seeing messages like "pvescheduler[555758]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout", in case it's related.
NB: Nodes are bare metal servers at OVH hosted in 3 different datacenters.
Things I tried without luck:
- Giving 10 votes to node1, but that didn't change anything to the situation
- Using zerotier-one IPs for corosync (thinking maybe something is blocked on OVH side), same
corosync.conf config: https://gist.github.com/tubededentifrice/36442fc07a34583f05f91cbdeb45f7ab
Logs when corosync is stopped on node3 and restarted on node1 and node2: https://gist.github.com/tubededentifrice/c510a6aae46b4a70da87fe2f815677f0
And when I start the third one: https://gist.github.com/tubededentifrice/7a942688da7fb3a5faf0980b1e6ce362
Any idea what could be going wrong?
Thanks
Edit:
NB:
- Firewall is fully open between nodes, but blocked from the outside.
- I'm using an authenticating proxy on top of my servers, so if you try to access node[123].mydomain.com using DNS IP then you'll be screwed, and :8006 is blocked. However if using /etc/hosts domains -> IP (everything is set properly there), then you'll be fine. Same as using direct IPs. Although I see no reason why this could be related, mentionning it just in case.
- And yes I triple checked that /etc/hosts is correct, most notably the "pvelocalhost" is set to the correct public IP of the server (should it be set to loopback interface instead?)
Last edited: