Adding a server to cluster causes cluster to crash

richinbg

Member
Oct 2, 2017
28
3
8
32
Hi,
I have been adding a new server to my Proxmox cluster successfully and after a few mins suddenly all machines were rebooting each other one after another...

The cluster is now made out of four servers - they also were all showing for like 2-5 mins that everything is OK and that the corosync is there.

But then suddenly machine 02 was rebooting itself and the quorum did get lost and in the end all the machines did reboot them selves besides the new machine.

This is really a bit odd to me especially since with then having four machines I expected that my cluster would be more stable and not running into issues compared to having just three servers in the cluster.
Also the Wiki does not state "be aware that if you add a server to the cluster, your complete cluster might fail over"...

I assume this is not expected standard behavior either?

Any tips would be welcomed so that I can understand and prevent this from happening again...
 
  • Like
Reactions: Pourya Mehdinejad
We also have the same issue with 15 nodes.
We were adding the 16th node and suddenly all 15 nodes reboot themself. Since they were also running ceph, we had a massive data corruption.
Imagine the disaster.

Still I don't understand how HA works, it supposed to protect the cluster from failing by fencing one node if it has an issue. But it is rebooting all our nodes without any certain reason.

We still haven't found anything on the internet or logs.
We have posted this in the forum, but no answer from Proxmox guys.
We even asked for a paid consultancy but no response.
 
Well I switched now to windows hyperV Cluster with S2D. Works or at least worked fabulous until currently I have some disk performance issues I didn't had before but could
Be shitty SSDs, or the simply too slow drives which caused issues with ceph as well...
Hope I can fix that - if not I will trade in the servers lol.
Anyways I also never could figure out the problem you were having...
are you using static or at least properly
Configured multicast?
 
  • Like
Reactions: Pourya Mehdinejad
Well I switched now to windows hyperV Cluster with S2D. Works or at least worked fabulous until currently I have some disk performance issues I didn't had before but could
Be shitty SSDs, or the simply too slow drives which caused issues with ceph as well...
Hope I can fix that - if not I will trade in the servers lol.
Anyways I also never could figure out the problem you were having...
are you using static or at least properly
Configured multicast?

I'm still trying to figure out the issue, even though the business is at risk, but I don't want to give up Proxmox so easily, but I might have to.
regarding multicast, It is said that proxmox 6 and corosync 3 don't use multicast anymore, so that possibility is out.
I've told that corosync is sensitive to latency, so we have separated the network for it into two dedicated switches and we made sure that the latency is always less than 1 ms. yet this issue happened again.
apparently something is behaving abnormally in the cluster and we can't find it yet.
The worst thing is we don't get any help.
 
Well you should just try again - they might just be super busy or have missing it or don't know them selves.

yeah i am running now everything in a dedicated subnet include vlan, however don't have just one switch(currently).
I hope you'll find a solution.

Do you have the connecting/edge ports secured to prevent storms of any kind etc?
 
We even asked for a paid consultancy but no response.

This is just not true, all your request are answered. Telling nonsense here in the thread will not help to fix your issues and will just damage our reputation.

You missed to get the support contract in time, so in case of emergency you had to purchase the support contract first, so you loose just time here.

Lessons to learn: Debug/Fix your issues by yourself or with the community or make sure you purchase your support contract BEFORE you are in the emergency situation. Our enterprise can team can review your HA cluster setup before you go in production - if you follow our recommendations, cluster issues are really rare.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!