Cluster with 2 nodes, no HA, what to do if 1 node is down?

Tony · Jul 17, 2014

Hello,

[I apologize if my question has been answered before. I was searching the forum and wiki but could not find a concise answer].

I plan to have a cluster of 2 nodes, no HA/DRBD, latest proxmox VE. Can someone please briefly highlight the steps what to do if:

case (1): I want to shutdown one node for maintenance, for example to replace a hard drive in a RAID array, or upgrade memory, etc. In this case, the node remains more or less identical after the downtime. From what I read, it's more complicated than just: migrate all VMs to the other node, shutdown, power on?

case (2): one node dies unexpectedly. What to do to restore the cluster to the state before the node died, given that I have backup.

Regards,
Tony

m.ardito · Jul 17, 2014

Tony said:
I plan to have a cluster of 2 nodes, no HA/DRBD, latest proxmox VE. Can someone please briefly highlight the steps what to do if:

hi, I have the same exact simple setup

Tony said:
case (1): I want to shutdown one node for maintenance, for example to replace a hard drive in a RAID array, or upgrade memory, etc. In this case, the node remains more or less identical after the downtime. From what I read, it's more complicated than just: migrate all VMs to the other node, shutdown, power on?

It's like that. Did so many times, no troubles. If you have NAS shared storage, you can also live move vm disks elsewhere, that helps much when you have to maintain the NAS (I have two)
You just have to remember quorum.if you shutdown one of two nodes, in this setup, the cluster loses quorum and locks (vm still working but locked operations, no backups, /etc/pve readonly, etc)
it's a temporary safety measure that protects pve. You can temporarily issue
#pvecm e 1
to regain quorum, until the other node is up again (pve will reset quorum itself when it sees the second node up)

Tony said:
case (2): one node dies unexpectedly. What to do to restore the cluster to the state before the node died, given that I have backup.

well, it's much like above,
issue #pvecm e 1

and restore vm on the left node from backups.

if you can solve issue on the node without reinstalling pve on it switch it on and you're back at 1)
then you can migrate all vm to their original node, if you want

if you happen to reinstall pve on the died node, it's similar, but you should consider that that's a totally new node to pve cluster.
you can do many things in this situation, but the best is probably
- add the new node (pve will see 3 nodes, 2 up 1 down)
- remove the died node (which does not exist anymore, since you reinstalled pve on it, or is a totally new hardware)
- i you have orphan configs under /etc/pve/nodes/lostnode, you can simply delete * that folder.

you could also reinstall pve, with identical config as the old node known to pve, and use a -force switch to join it back but it's not what pve teams would suggest you as troubles may come (ssh keys are different, the cluster config could have wrong settings, other, I suppose)

Marco

Tony · Jul 17, 2014

Marco,

thank you very much for the *great* answer. Exactly what I was hoping for.

Regards,
Tony

m.ardito · Jul 17, 2014

You're welcome

but there's more.
You have to consider that in case 2) if you simply repair your node, it could have old vms on it (same ID? or you restored from backups with different IDs?), configured to automatically boot, that you meanwhile restored, to the survived node... and running perhaps!... you will have a conflict... and if they use the same network storage for their disks... you're in trouble.

So, generally, reinstalling nodes is better. Or, disable those old vms at least. You have to be very careful. eg: restore from backup with different id, when powering up repaired node, first keep it offline (from other nodes and shared storage), see what happens, disable automatic vm boot... etc. The best thing is probably delete them from the repaired node and migrate the running restored copies back from the other node. You have to deal with this. And be very careful (did I already say that?

)

If you have HA setup is generally better: it's a bit complex and must work, requires lot of testing, but it takes automatically care of those situations, avoiding vm conflicts cluster-wise and restarting them automatically on survived nodes (if you have automatic quorum, so 2 nodes cannot do this, 3 is minimum), not restarting them if the nodes "resurrects"... (eg: solved network problems).

Marco

Tony · Jul 26, 2014

Hi Marco,

[sorry for the late reply, could not catch things up]

many thanks for those useful hints; much appreciated.

it seems like in case (2), migrating all VMs to the 2nd node and reinstall the crashed node is the simplest way. I will stick with this plan then.

Regards,
Tony

cesarpk · Jul 28, 2014

You can use these directives in your cluster.conf file:
two_node="1" expected_votes="1"

These directives will be useful only if your PVE cluster have only two nodes, then, your two nodes never will lose quorum, and you can do anything you want ( goodbye to the command: pvecm e 1 and other problems with his collateral effects)

Re Edited: These directives will be dangerous in a PVE cluster with HA and with fence device configured.

Tony · Jul 28, 2014

another useful tip! many thanks.

Tony

Search

Search

Cluster with 2 nodes, no HA, what to do if 1 node is down?

Tony

Renowned Member

m.ardito

Famous Member

Tony

Renowned Member

m.ardito

Famous Member

Tony

Renowned Member

cesarpk

Well-Known Member

Tony

Renowned Member