Understanding "Cluster not ready - no quorum? (500)" during VM start

Feb 17, 2020
106
21
38
44
I'm administering a Proxmox setup that has two machines in cluster:
1) a mission critical server - which runs on a long lasting UPS (should run without power for hours)
2) a non-critical server - that runs on a small UPS (lasts for a couple of minutes to turn off the server properly)

The problem is that when there is a power outage and I reboot server 1), I cannot start any VM because of the "cluster not ready - no quorum? (500)" error message. The VMs I want to start has nothing to do with server 2) at all, no shared resources, no shared storages. In my opinion I should be able to start these VMs.

Of course, a good question, why am I running server 1) and server 2) in cluster? Well I do that because I want to be able to migrate VMs between 1) and 2) as needed.

IMO when a VM is not running in HA mode, it should start fine even if another (non-related) node of the custer is down. Is my understanding wrong? Nobody else is having problem like me? Thanks!
 
IMO when a VM is not running in HA mode, it should start fine even if another (non-related) node of the custer is down. Is my understanding wrong? Nobody else is having problem like me? Thanks!

The problem is that PVE does not know (100%) that the other server is offline, and to avoid split-brain it rejects to modify or start any VMs.
The good news is that you can simply set "expected votes" to one in that case, and you can work again:

# pvecm expected 1

Note: But please only do that if you are 100% sure the other node is offline.
 
Hi @dietmar - I was uncertain whether I can do a `pvecm expected 1` legally and then change it to `pvecm expected 2` before I power on the second server? Is that a correct assumption?

Do you think that I should discard this cluster and go ahead with two separate PVM machines if it's essential to my client for one machine to operate 100% functional if another node if offline (there are numerous reasons: hardware failure, software failure, power failure)?

My assumption is: building VMs means that failure of one system (VM) does not affect other systems (VMs). On the other hand if a failure of one single cluster node requires non-trivial manual intervention then I think clustering is a good idea only-and-only if someone requires HA / heavy load balancing. Is my assumption correct?

Thanks!
 
Hi @dietmar - I was uncertain whether I can do a `pvecm expected 1` legally and then change it to `pvecm expected 2` before I power on the second server? Is that a correct assumption?

There is no need to set expected votes to '2' - this is done automatically when the second node connects.

Do you think that I should discard this cluster and go ahead with two separate PVM machines if it's essential to my client for one machine to operate 100% functional if another node if offline (there are numerous reasons: hardware failure, software failure, power failure)?

My assumption is: building VMs means that failure of one system (VM) does not affect other systems (VMs). On the other hand if a failure of one single cluster node requires non-trivial manual intervention then I think clustering is a good idea only-and-only if someone requires HA / heavy load balancing. Is my assumption correct?

Yes (for 2 node clusters, and only If you really think the command "pvecm expected 1" in non-trivial manual intervention).
 
Hi @dietmar, thanks for your aswer.

There is no need to set expected votes to '2' - this is done automatically when the second node connects.

Just to clarify: in your first comment you wrote that I should run the `pvecm expected 1` command only if I'm absolutely sure that the second host is not running. Do I understand correcly that even if I did run the `pvecm expected 1` command while the second host was running, it wasn't a disaster because I would just restart the second node and it would re-register itself? I.e. restart solves this problem?

Yes (for 2 node clusters, and only If you really think the command "pvecm expected 1" in non-trivial manual intervention).

Well, yes, I think it's non-trivial in a manner that it's not trivial to automate :( First we need to count the nodes that are active, we need to make it sure that the other nodes are really down, then we need to execute the `pvecm expected N` command based on the information above and everything needs to run as a startup script so that no manual intervention is necessary. Additionally timing is also an issue because we should give the other node enough time to boot before we declare that it's dead and decrease the 'expected' count. — This is what I meant by non-trivial :)
 
Do I understand correcly that even if I did run the `pvecm expected 1` command while the second host was running,

As stated above, you should never do that because can lead to split brain.

Well, yes, I think it's non-trivial in a manner that it's not trivial to automate :(

Yes, that is why most people use at least 3 nodes for a cluster.
 
As stated above, you should never do that because can lead to split brain.

But it's still correct to 1) set `pvecm expected 1` and 2) then power on the second node? (As per the suggestion in your 2nd response)

Yes, that is why most people use at least 3 nodes for a cluster.

Does it mean that if I have a cluster of 3 or more nodes then one node failing does not paralyze other nodes from starting VMs, as long as there are at least 2 nodes running?
 
But it's still correct to 1) set `pvecm expected 1` and 2) then power on the second node? (As per the suggestion in your 2nd response)

Sorry, but I do not understand that question. I makes no sense to set expected votes manually If you power on the second node anyways.

Does it mean that if I have a cluster of 3 or more nodes then one node failing does not paralyze other nodes from starting VMs, as long as there are at least 2 nodes running?

You need "quorum", that means a majority of nodes must be online (more than half of your nodes).
 
Sorry, but I do not understand that question. I makes no sense to set expected votes manually If you power on the second node anyways.



You need "quorum", that means a majority of nodes must be online (more than half of your nodes).
Jumped up a little bit late to the post but why dont stop the quorum service until he needs it?
 
I had this error with me and this solution helped me a lot:
scp that files from the node that working fine in your cluster to the node that have an issues such as (proxmox no quorum 500)
scp -r /etc/corosync/* root@xx.xx.xx.xx:/etc/corosync/
scp /etc/pve/corosync.conf root@xx.xx.xx.xx:/etc/pve/
systemctl restart pve-cluster
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!