Proper procedure for restarting entire cluster

cwells

Member
Jun 8, 2009
33
0
6
I set up an 8 node cluster (pve 2.x) in the lab. At some point I needed to reboot all the machines, so I did (I was logged into all of them with ssh and so just issued reboot command).

After they all came back up, I'm hit with the loss of quorum. Each time I reboot a node, it informs me that the cluster isn't ready. I can issue pvecm expected 1 to manually restore quorum, but this appears to be only a temporary fix. Any time a node is restarted it comes up without quorum. As a result, I ended up deleting and rebuilding the cluster.

Is there a proper procedure for restarting the entire cluster that will avoid this situation? Once I've finished my configuration and testing, this cluster will need to be moved into colocation, but I'd hate to have to recreate the cluster after doing so. I also have to consider the possibility that unforeseen events might cause the cluster to be shutdown at some point in the future (e.g. total power loss in the cabinet). Is there any way to have the cluster recover after such an event?
 

tom

Proxmox Staff Member
Staff member
Aug 29, 2006
15,525
911
163
a re-creation is never needed. you can always get online again.

as a basic rule, you need to make sure that you never loose quorum. e.g. if you need to reboot, just reboot one node and if the node is online again, reboot the next one. step by step so you have always quorum.
 

cwells

Member
Jun 8, 2009
33
0
6
If I need to move the machines to a different facility or power is lost, rebooting them one at a time isn't an option. Nevertheless, I think I've found my answer.

I set up an 8 node cluster (pve 2.x) in the lab. At some point I needed to reboot all the machines, so I did (I was logged into all of them with ssh and so just issued reboot command).

Actually, I'm realizing the problem (and I provided bad information in my post). I didn't actually bring up all 8 nodes, I only brought up 4 because I'm on a 20A circuit and I also needed to bring up a pair of storage servers and associated arrays and I was pushing close to the 20A limit. I was under the impression I only needed 3 nodes for quorum, but I've since read elsewhere that I need >50% of the nodes at all times in order to maintain quorum:

http://www.karlkatzke.com/stonithfencing-why-you-need-it/

A quorum is defined as > 50% of the machines in the cluster. Not >= 50%, but > 50%. Two machines won’t do it in a four-node cluster.


So apparently bringing up 4 nodes out of an 8 node cluster was a bad idea. I will retest. I'm assuming that as soon as 5 nodes are up quorum will magically appear =)
 
Last edited:

mir

Famous Member
Apr 14, 2012
3,559
120
83
Copenhagen, Denmark
If you are faced with power limits you could create a small quorum disk on your storage and add this to the cluster which will then be a 9 node cluster where a quorum can be gained by 5 nodes or 4 nodes and the quorum disk. In this way, provided the quorum disk resides on the storage which must be brought up early anyway, you will be able to have your quorum with only 4 nodes up and thereby be able to power up the entire cluster given your power limits.
 

cwells

Member
Jun 8, 2009
33
0
6
If you are faced with power limits you could create a small quorum disk on your storage and add this to the cluster which will then be a 9 node cluster where a quorum can be gained by 5 nodes or 4 nodes and the quorum disk. In this way, provided the quorum disk resides on the storage which must be brought up early anyway, you will be able to have your quorum with only 4 nodes up and thereby be able to power up the entire cluster given your power limits.

Interesting you mention this. For my storage I built a 2-node NFS cluster (corosync/pacemaker, failover mode) on top of a shared fibre channel loop, and I was considering the possibility of putting the NFS servers into the same cluster as the PVE nodes, since that would also provide quorum to the storage cluster. It seems possible, since they use the same cluster software, but since the clusters serve different purposes, I wasn't sure about how to proceed or if this would be a recommended approach. Also the NFS servers are running Scientific Linux, not PVE, so that would be another hurdle.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!