[SOLVED] Looking for help w/ problems after dying quorum member

dtiKev

Well-Known Member
Apr 7, 2018
78
4
48
Hello again, the fact that I haven't been around in a while is testament to how well this system works :)

I have a 2 node cluster where each node runs a few guests and they each backup to the other. The goal is to spread load and make sure that if one server fails the other is able to take up the slack. This has been working well for a good 3 or 4 years now. Well one of the servers (an old AMD based Dell r715) is showing signs of failure. I've moved all guests over to the newer Intel based dell box for now. My current task is the replacement of the r715.

I know, I know, best practice would be identical or more similar systems. I'm dealing with what I have access to and a limited budget. We don't do live migrations and none of these are really mission critical so long as I can get the guests running when needed we are golden. I have a pretty decent Intel machine here that I intend to make the new server. I know I could set it up fresh, join it to the cluster, and start migrations, etc. What I am wondering is if it's possible for me to simply move the hard drives over from the AMD based r715 and just boot it up.

I could just try that but figured I'd see if anyone know if that could be successful in a way that it will just "be" the member of the cluster with different hardware. The basics of the setup was three hard drives. One is an SSD for speed on some guests, the other two were zfs mirrored for redundancy. The OS is on the zfs mirror.

If this is not advisable then I'll go the standard route of course.
 
Last edited:
Okay so moving the discs over to the new server isn't working... although it does get to the boot screen.... Just throws a lot of errors.

So if this isn't possible, what's my best route? Do I setup fresh and add to the cluster? How do i remove a dead system from the cluster? Or do I force boot the old server first and remove it from the cluster?
 
So now I see that I can't even power up machines on the remaining node because it complains that there is no quorum. Is there a way to break the good server out to be a standalone while I build the new server? Otherwise to I have to try and force the old one back online to break things apart?
 
The cluster needs to see of more than half of the systems (to make sure it is in control), and you have only half of the systems running and are therefore losing quorum.
You seem to be running into the exact problem mentioned at the end of this Wiki entry, which suggest that you HA setup was not correct (HA requires 3 systems in order to support 1 failing). Can you disable HA and/or fencing? Maybe someone with more cluster experience can tell you how, because I do not know the details, sorry.
 
HA begins with 3 hosts in most of the virtualization platforms (not only Proxmox).
Proxmox with de replication process can nearly achieve the same with only 2 hosts.

Nearly all is explained in this article (https://pve.proxmox.com/wiki/Cluster_Manager)

But the quorum wants more than half the hosts (50% +1) to be up to unlock the HA fonctionnalities.
Several "hacks" can be done :
- install a Proxmox on a small machine that will not run (and old desktop PC, or a reclycled server) any VMs, but can be the third of the quorum (qdevice function only if you're fine with Linux CLI)
- install a VM with a Proxmox inside that will be the third (not very clever but it runs ...)
- configure the quorum not the respect the 50% ratio (Cf. this article on StackOverflow that explain the mecanism)
Regards
 
I only ever had two nodes. The one that's down now was limping along so I migrated all guests to the "good" one.

Now I can't power on and guests that are down because of lack of quorum. I am going to see if I can get the "bad" server back online long enough to break the quorum.
 
@Pierre-Yves:

I didn't see your post before my last post. But as of now, it's official, pvn1 is dead. So I'm stuck with one node in a two node cluster. Luckily all guests are on pvn2. So I'm looking for best bet to get back to where we were... and for future I will spin up a third "dummy" node so that quorum wouldn't be an issue in the future.

So new question is, what's my best bet? I assume my two options are to either build a new server and import all of the hosts to it. Then wipe the pvn2 server and setup fresh, create a cluster, and setup my replications for the safety net. Then add a virt pvn3 to keep quorum intact.

Other option would be if there's an easy way to break the good server (pvn2) out of the cluster or free it from the failed quorum - and then setup a new machine and either join that cluster or put the two of them into a new one. And finally, again, Virt PVN3.

Any thoughts? If the option is to move guests to a new server is there an easy way to do that? I'm searching the docs now...
 
@Pierre-Yves:

I didn't see your post before my last post. But as of now, it's official, pvn1 is dead. So I'm stuck with one node in a two node cluster. Luckily all guests are on pvn2. So I'm looking for best bet to get back to where we were... and for future I will spin up a third "dummy" node so that quorum wouldn't be an issue in the future.

So new question is, what's my best bet? I assume my two options are to either build a new server and import all of the hosts to it. Then wipe the pvn2 server and setup fresh, create a cluster, and setup my replications for the safety net. Then add a virt pvn3 to keep quorum intact.

Other option would be if there's an easy way to break the good server (pvn2) out of the cluster or free it from the failed quorum - and then setup a new machine and either join that cluster or put the two of them into a new one. And finally, again, Virt PVN3.

Any thoughts? If the option is to move guests to a new server is there an easy way to do that? I'm searching the docs now...

Disable corosync with:
Code:
pvecm expected 1

Setup a third node or for cheap a qdevice on a raspberry pi.

https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_external_vote_support
 
Last edited:
So I need to kill corosync on the working node member and install it on a third server. Is the next step adding the third server to the cluster from the working server and if so does that also start the cluster back up? The doc doesn't take into account a cluster node being missing so I don't want to make matters worse.

Code:
#apt install corosync-qdevice
#pvecm expected 1
#pvecm qdevice setup 192.168.1.3
All nodes must be online! Node pvn1 is offline, aborting.
 
Last edited:
So maybe there ISN'T a way to salvage what I have then?

Since I can't seem to force a quorum, is the proper route to make backups of all machines and move them to a freshly setup server?

Once I get to that point I could wipe and re-setup the working one and create a new cluster and add the corosync dummy node on one of my debian boxes to ensure future quorum on fail?
 
So I need to kill corosync on the working node member and install it on a third server. Is the next step adding the third server to the cluster from the working server and if so does that also start the cluster back up? The doc doesn't take into account a cluster node being missing so I don't want to make matters worse.

Code:
#apt install corosync-qdevice
#pvecm expected 1
#pvecm qdevice setup 192.168.1.3
All nodes must be online! Node pvn1 is offline, aborting.

Remove node pvn1 first.

Code:
pvecm delnode pvn1

Make sure it's shut down and never starts up again.

https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node
 
Spot on... thank you! I now have the new server in the original cluster and I've successfully moved some guests there and am running one. Onto adding a third node just for quorum stuff.
 
If anyone is still reading here... if I have a little NUC device and can throw a large enough drive in it would it be okay as a third vote for quorum AND a replication location for backups if it has only 2 procs and 4GB ram? This would allow for the two main servers to have more leeway in splitting the load but obviously I want to make sure that I've got redundancy as far as system failure goes.

Otherwise I'll just set it up as a Corosync vote for the quorum.
 
I run Proxmox Backup Server in a container with half a core and 3.5GB (for less than 150GB of backups) and it works fine, but I did have problems sometimes with less memory. Maybe just try it? And if it is too slow or something, just use it for quorum and not much else.
 
  • Like
Reactions: dtiKev

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!